Neural Network Predicating Movie Box Office Performance - ECE 539 Alex Larson Fall 2013
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Neural Network Predicating Movie Box Office Performance Alex Larson ECE 539 Fall 2013
Abstract The movie industry is a large part of modern day culture. With the rise of websites like Netflix, where people are able to watch hundreds of movies at any time, it is evident that film is a large part of our culture today. Movie studios are always trying to come up with the next big thing to make the largest profit. Studios have been adapting books, plays, and comic books to cash in on an already existing popular intellectual property. Studios have also been remaking older films in the hopes that they will have the same level of success as its predecessor. Making a movie is an expensive endeavor and people want to know if a remake, an adaptation, or an entirely new idea will be successful. Some current examples of how things are being predicated as being done by using data from sites like google and Wikipedia. Studies have been done using the number of searches a movie gets on google or how many hits a Wikipedia page gets for a certain movie to predict its box office success. The above methods have been shown to work well but I also believe you can predict the success of a movie based on many of its features. Some of these features may include genre, budget, release date, which studio making the movie, if the movie is or is not a new intellectual property, actors involved, MPAA rating (PG, PG-13, etc.), and many more. Using these features one should be able a prediction of a movie’s potential box office success. I propose to use some artificial neural network methods to classify and predict a movie’s potential box office success. Using some of the above features of movies described above I would like to create a data set based on movies within the past few years. After a good set of features and classes have been established, I will use artificial neural network algorithms and experiment with various pattern recognition classifiers like Multi-Layer Perceptron (MLP), k-nearest neighbor classifier, etc. to predict the potential box office success of a movie. Introduction and Motivation The movie industry is a large part of modern day culture. Many companies look to profit off the success of a movie. The distributor of the movie gains the profit from ticket sales while many other companies advertise and promote their products by featuring them in movies or having the movie associated with their own products to boost revenue. One major motivation behind this project to help investors choose which movies could have the highest possible return. Movies are very expensive to make and some wish to know if the payoff will be worth their investment. Movies are also something I
enjoy very much. Like many people I think they are a wonderful form of entertainment. It was my hope that this project would be fun and interesting way to look deeper into movies and the box office performance behind them. Related Work There have been a few recent projects that have dealt with predicating movie box office performance. One study was done based on the hits of a movie’s Wikipedia page. The researchers for this study analyzed the activity of editors on the online encyclopedia Wikipedia. Based on this data they built a minimalistic predictive model for a movies box office success. [1] Google also performed research on movies box office success. Google used trailer related searches for a particular movie along with the franchise status of the movie and the season to predict the opening weekend of a movie with 94% accuracy. Problem Statement The goal of this project is to predict the potential box office success of a given movie based only on its given characteristics at its release. Data The data for the project was acquired from the-numbers.com. This website tabulates many movie characteristics and statistics. Movie data from the years 2008, 2009, 2010, 2011, 2012, and an incomplete version of 2013 were obtained. This project was performed late in 2013. While it was incomplete its data was still a good representation for movies released earlier in 2013. Features that were extracted from the data were as follows: movie’s release month, distributor, genre, MPAA rating, and whether or not the movie was a sequel. Values were assigned for distributors, genre, and MPAA rating. For each year a subset of movies were selected at random from the top performing movies for that given year. Based on the movies yearly gross I choose to divide the data into 3 classes: Movies grossing less that 49 million, between 49 million and 91 million and more than 91 million. This data was
then translated into machine readable text flies that were used by various MATLAB programs used to run the experiments for this project. Experiments Using the MATLAB programs from the ECE 539 website various experiments were done with the k nearest neighbor classifier, maximum likelihood classifier and multilayer perceptron. The initial results of experimentation were not promising. Each classifier was achieving on average around 30% classification rate. This value is unacceptable because it is essentially the same as random guessing when there are 3 possible classification labels. From here the data was reevaluated. I plotted histograms each feature for each class label. I found that there were many outliers in the distributor, and genre features. Some smaller distribution studios would have a successful movie in one of the years where data was collected but not in others. Similarly in genre some genres like western and musical for example are just not represented enough in the data. These outliers where then removed from the data. The values assigned to the features were also reorganized. The distributor with the most successful movies was given a higher value, and the same thing was done with genre and MPAA rating. Results For all classifiers cross validation was used. I would leave one year out of the training data and train the classifier with the data from the remaining years. After classification had completed I would test the trained classifier using the data from the remaining year that was not included in the training data. The k-nearest neighbor classifier was the fastest of the 3 classifiers used. For the kNN classifier I tested many different values of for K. the best results I achieved where when I used 14 nearest neighbors. This resulted in and average classification rate around 48% an improvement from the first implementation. KNN Classifier Testing Data 2008 2009 2010 2011 2012 2013 Average C Rate (%) 48 64 52 56 32 36 48
Confusion Matrix 31 12 7 24 14 12 15 8 27 I then performed classification of the data using the maximum likelihood classifier. This classifier also computes its results very quickly. The results of the maximum likelihood classifier do not change between different runs so this classifier only had to be run once. This classifier performed on average about as well as the kNN model. Maximum Likelihood Classifier Testing Data 2008 2009 2010 2011 2012 2013 Average C Rate (%) 48 56 52 56 48 24 47.3 Confusion Matrix 34 10 6 25 10 15 11 12 27 Finally classification was done using the multi-layer perception. Many various perceptron networks were experimented with. This program took the longest out of the three classifiers to run. It also was run over multiple trials because the results change for each trial run. The MLP training was showing promise. It was classifying around 60% during training but when it came to the actual testing data it performing similarly to the kNN and maximum likelihood classifiers with an average classification rate around 47.3%.
MLP back propagation Testing Data 2008 2009 2010 2011 2012 2013 Average C Rate (%) 52 48 48 52 40 44 47.3 Confusion Matrix 23 14 13 15 18 17 13 8 29 Discussion The results of these experiments where not superb but they were an improvement from my preliminary classification runs. Some interesting predictions that I found with the MLP model for 2013 were that it correctly predicted into the most successful class label were Iron Man 3, Hunger Games: Catching Fire, and Oblivion. Some interesting misclassifications were Gravity which was in the most successful category but classified in the worst. Other interesting misclassifications were After Earth and The Internship both did poorly but were predicted to do well. All three classifiers tended to do better classifying movies on for either the low class or the high class where in the middle it would seldom choose correctly. There may not be enough of a correlation between this set of feature vectors and the chosen class labels. Movie performance can be erratic as shown in the preliminary testing. Every so often you get outliers that come out of nowhere from lesser known studios and do extremely well and on the other hand sometimes you have huge movie flops coming from studios that normally put out great movies. This classifier in the end did not perform as well as the google or Wikipedia classifiers. Some improvements that could be made to this data set would be to increase the sample size of the movies
this may lessen the effect of that outliers may have been effecting classification. Adding more features to the feature vectors could also improve performance. Other characteristics such as a movie’s budget, leading actor, director could also have an effect on the classification. References: [1] Mestyán M, Yasseri T, Kertész J (2013) Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data. PLoS ONE 8(8): e71226.doi:10.1371/journal.pone.0071226 [2]Chen, Andrea, Panaligan Reggie (2013) Quantifying Movie Magic with Google Search [3] http://boxofficemojo.com [4] http://www.the-numbers.com
You can also read