Project

movieDB

0.0
No commit activity in last 3 years
No release in over 3 years
Perform Data Analysis on IMDB Movies
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.3
>= 0

Runtime

 Project Readme

MovieDB

MovieDB is a multi-threaded ruby wrapper for performing advance statistical computation and high-level data analysis on Movie Data from IMDb. The objective and usage of this tool is to allow producers, directors, writers to make logical business decisions that will generate profitable ROI.

Badges

  • Dependency Status
  • Join the chat at https://gitter.im/keeperofthenecklace/movieDB
  • Coverage Status
  • Code Climate
  • Gem Version
  • Build Status

Technology

  • SciRuby is used for all statistical and scientific computations.
  • Redis is used to store all data.
  • IMDb and TMDb is the source for all film.
  • BoxOfficeMojo is where we will be scraping future film.
  • Celluloid is used to build the fault-tolerant concurrent programs. Note, if you are using MRI or YARV, multi-threading won't work since these types of interpreters have Global Interpreter Lock (GIL). Fortunately, you can use JRuby or Rubinius, since they don’t have a GIL and support real parallel threading.

Requirements

ruby-2.2.2 or higher

jruby-9.0.0.0

Installation

Redis Installation

This tutorial doesn't cover redis installation. You will find that information at: http://redis.io/topics/quickstart

movieDB is available through Rubygems and can be installed via Gemfile.

gem 'movieDB'

And then execute:

$ bundle install

Or install it yourself as:

gem install movieDB

Console - loading the libraries

$ irb

Require the gem

require 'movieDB'

Initialize MovieDB (multi-thread setup)

m = MovieDB::Movie.pool(size: 2)

Step Process

Fetching and analysing movie data using movieDB is a simple 2 step process.

First, fetch the data from IMDb.

Next, run your choice of statistic.

That's it! It is that simple.

Part 1 - Fetch Data from IMDb

There are 3 ways to find IMDb ids.

  • Search IMDb id via API

  • Search IMDb id via Website

  • Generate random IMDb ids.

Search IMDb id via API

You can read the documentation for IMDb API to see all that you can do with this gem.

i = Imdb::Search.new("Star Trek")

i.movies.size  #=> 97

This will return 97 objects related to 'Star Trek'

To collect all the IMDb ids

ids = i.movies.collect(&:id).uniq

#=> ["0796366", "0060028", "0079945" ...]

Search IMDb id via Website

To find IMDb id for specific movies, you must go to:

http://www.imdb.com

Search for your movie of choice. Once you do, IMDb redirects you to the movie's page.

The URL for the redirect page includes the IMDB id.

http://www.imdb.com/title/tt0369610/

0369610 is the IMDb id.

Generate random IMDb ids (multi-thread setup)

You can fetch IMDb ids random. This approach will probably run you into some problems, see Disclaimer.

r = Random.new

39.times do |i|
  m.async.fetch(sprintf '%07d', r.rand(300000))
  sleep(4)
end

sleep(10)

Note: IMDB has a rate limit of 40 requests every 10 seconds and are limited by IP address, not API key. If you exceed the limit, you will receive a 429 HTTP status with a 'Retry-After' header. As soon your cool down period expires, you are free to continue making requests.

Also, movieDB will throw a NameError if the randomly generated IMDb id is invalid.

Get Movie Data

m.async.fetch("0369610", "3079380", "0478970")

By calling m.async, this instructs Celluloid that you would like for the given method to be called asynchronously. This means that rather than the caller waiting for a response of querying both IMDb and TMDb, the caller sends a message to the concurrent object that you'd like the given method invoked, and then the caller proceeds without waiting for a response. The concurrent object receiving the message will then process the method call in the background.

Asynchronous calls will never raise an exception, even if an exception occurs when the receiver is processing it.

Redis - caching objects

By default, any movie fetched from IMDb is stored in redis and has an expiration time of 1800 seconds (30 minutes).

But you can change this expiration time.

m.async.fetch("0369610", "3079380", expire: 86400)

Here, I set the expiration time to 86400 seconds which is equivalent to 24 hours.

Part 2 - Run the statistic

Below, we've collected 3 specific IMDb ids to analyze.

  • Ant Man - 0369610
  • Jurassic World - 079380
  • Spy - 0478970

Finding the Mean value.

m.mean

Below is the result generated.

                             mean
       ant-man  576.8444444444444
jurassic_world  512.5111111111111
           spy 369.73333333333335

Below are more statistic methods you can invoke on your objects.

Feel free to try them out.

  • std
  • sum
  • count
  • max
  • min
  • product
  • standardize
  • describe
  • covariance
  • correlation

Layout and Template

movieDB allows you to view all your data fields in a worksheet style layout.

m.worksheet

A total of 45 fields are printed out. But, we've truncated the result for ease of reading.

              ant-man jurassic_w        spy
production        177        128         40
belongs_to          0        151          0
plot_synop       9083          0       9629
   company         14         18         21
     title          7         14          3
filming_lo        267       1037        530
cast_chara       4094       5894       1001
trailer_ur          0         46         45
cast_membe       2833       3452        939
     votes          5          6          5
     adult          5          5          5
also_known        928       1601       1195
  director         15         19         13
plot_summa        373        298        311
 countries          7         16          7
       ...        ...        ...        ...

Filters

When performing statistics on an object, movieDB by default processes all fields.

However, you now have the option of filtering what fields you want processed using the following filters:

  • only
  • except

'only' analyzes the fields you provide.

'Except' is the inverse of 'only'. It analyzes all the fields you did not provide.

m.standardize only: [:budget, :revenue, :length, :vote_average]

Processes only budget, revenue, length and vote_average values.

              ant-man jurassic_w        spy
    budget 1.49999999 -0.3616594 1.49999999
   revenue -0.5000006 1.49304559 -0.5000013
    length -0.4999988 -0.5656929 -0.4999976
vote_avera -0.5000005 -0.5656931 -0.5000010

Commands

movieDB comes with commands to help you query or manipulate stored objects in redis.

  • HGETALL key Get all the fields and values in a hash of the movie
m.hgetall(["0369610"])
# => {"production_companies"=>"[{\"name\"=>\"Universal Studios\", \"id\"=>13},...}
  • HKEYS key Get all the fields in a hash of the movie
m.hkeys
# => ["production_companies", "belongs_to_collection", "plot_synopsis", "company", "title",...]
  • HVALS key Get all the values in a hash of the movie
m.hvals
# => ["[{\"name\"=>\"Universal Studios\", \"id\"=>13}, {\"name\"=>\"Amblin Entertainment\",...]
  • ALL_IDS key Get all the id of movies
m.all_ids
# => ["0369610", "3079380"...]
  • TTL key Gets the remaining time to live of a movie.
m.ttl("0369610")
# => 120
  • DELETE key deletes a single movie object stored in redis.
m.del("0369610")
# => # => ["3079380"...]
  • DELETE_ALL key deletes all movie objects stored in redis.
m.delete_all
# => []

Contact me

If you'd like to collaborate, please feel free to fork source code on github.

You can also contact me at albertmck@gmail.com

Disclaimer

This software is provided “as is” and without any express or implied warranties, including, without limitation, the implied warranties of merchantibility and fitness for a particular purpose. Neither I, nor any developer who contributed to this project, accept any kind of liability for your use of this library.

IMDB does not permit use of its data by third parties without their consent.

Using this library for anything other than limited personal use may result in an IP ban to the IMDB website.

Copyright (c) 2013 - 2016 Albert McKeever, released under MIT license