Data Analysis of Movies and TV Shows on Netflix
Emma Finkel, Lauren Walton — SI 330
Motivation
The nature of our project is to analyze Netflix data in conjunction with review data from Rotten Tomatoes and IMDB. We chose to pursue this topic because streaming content takes up a large portion of our free time. As such, we’d like to analyze aspects of Netflix Data through different lenses to maximize our experience with the service saving both time and money.
Every user has different criteria for which movies and TV shows are worth watching, and thus, we chose to address the topic by analyzing Netflix data through several different lenses. The specific questions we would like to answer:
- Which directors on Netflix receive the highest ratings for their movies and/or TV shows, on average?
- Which month in the 2010s (2010 -2019 inclusive) had content added to Netflix with the highest average ratings? And how did average ratings for monthly content added to Netflix change over time?
- What month of the year does Netflix add content with the highest average ratings?
- Do TV shows with a greater number of seasons receive higher ratings, on average, than those with fewer seasons?
- What are the average ratings for Netflix’s movies and TV shows, across different target audience age groups? Among each of those groups, which movie or tv show has the highest rating? Additionally, how is content on Netflix distributed across target audience age groups?
- Is there a significant difference between the average ratings of movies and TV shows?
By answering these questions, we hope to help others find the content that interests them, and determine whether or not the streaming service is worth their money. Additionally, we hope the insights provided by our analysis could be used by Netflix to better achieve their key business objectives.
Data Sources
Source 1: “Netflix Movies and TV Shows”
This dataset was found on Kaggle. We extracted the information from a zip file into a csv file and loaded it into a Pandas DataFrame. We decided to use this dataset because it provided elaborate information regarding Netflix’s movies and TV shows, such as titles (string), director names (string), the date it was added to Netflix (datetime), and the duration in minutes or number of seasons (string). We retrieved 7,787 records for use in the database. Entries in the dataframe ranged from January 1, 2008 to January 16, 2021.
Source 2: “Movies on Netflix, Hulu, Prime Video, and Disney+”
This dataset contains movies found on Netflix, Hulu, Prime Video, and Disney+. This dataset was also found on Kaggle. We extracted the information from a zip file, which was then stored in a database on PostgreSQL. Unlike the previous dataset (Source 1), this dataset contains the IMDb and Rotten Tomatoes scores for each movie, which is the main reason why we wanted to use this dataset. We identified the most important columns of this dataset as the title (string), IMDb rating (float), Rotten Tomatoes (string), Netflix (integer), and Type (integer). The Netflix and Type columns were made up of 0s and 1s to signify boolean values. We retrieved 16,744 records for use in the dataframe. Year entries in the dataframe ranged from 1902 to 2020.
Source 3: “TV shows on Netflix, Hulu, Prime Video, and Disney+”
Our third dataset is very similar in structure to the previous dataset (Source 2), except it contains only the TV shows found on Netflix, Hulu, Prime Video, and Disney+. We found this dataset on Kaggle, and extracted the information from a zip file, which was then stored in a database on PostgreSQL. The most important columns of this dataset are the title (string), IMDb rating (float), Rotten Tomatoes (string), Netflix (integer), and Type (integer). The Netflix and Type columns were made up of 0s and 1s to signify boolean values. We retrieved 5,610 records for use in the database. Year entries in the dataframe ranged from 1901 to 2020.
Data Processing
The first step in the data cleaning process was to clean the Movies Data (Source 2) and TV Data (Source 3) using queries on SQL. In our database, the table for source 2 is called “movies” and the table for source 3 is called “tv_shows”. First, we converted all of the column titles in the movies table and tv_shows table to lowercase so that we would have matching column names to union the two tables together. Additionally, we renamed the column “Type” to “is_tv”, so that the column would be more representative of the values in it (1 is for TV shows, 0 is for movies). In both tables, the necessary columns were “title”, “age”, “year”, “is_tv”, “imdb”, “tomatoes” (Rotten Tomatoes score), and “netflix” (1 for on Netflix, 0 for not on Netflix). We used SQL queries to select all of the values in the necessary columns from both tables. Because we were selecting the same columns from both tables, we were able to use SQL union to combine the movies and tv_shows table. Once we had combined the 2 tables, we were able to read the larger table in as a Pandas DataFrame.
The Netflix data from Source 1 was immediately read into a Pandas DataFrame “netflix_df”. In order to perform our analysis, we would need a “master” DataFrame containing all of the content found in both “mtv_df” and “netflix_df”. This larger DataFrame was important because it allowed us to store all of the information regarding Netflix’s content in one source DataFrame.
Before creating the larger DataFrame, “mtv_df”, needed additional cleaning. We converted the Rotten Tomatoes scores from strings into floats (80% → 8.0) using a function we created called “tomatoes_to_float”. We needed the Rotten Tomatoes scores to be in the same format as the IMDb scores, so that we could eventually average those ratings.
Next, we performed an inner join on the columns containing content titles in “netflix_df” and “mtv_df”. We used an inner join to only keep the entries that were present in both DataFrames. With this larger DataFrame, we dropped several columns that we knew we would not use, such as “show_id”, “cast”, and “release_year”. We also dropped the columns that provided duplicate information from both DataFrames, such as the “type” column in netflix_df, which provided the same information as the “is_tv” column in “mtv_df”. Dropping these columns allowed us to create a more concise DataFrame for further analysis. Additional cleaning steps included renaming the “rating” column to “maturity_rating” to better describe the values in the column, and converting the “date_added” column and “year” columns into datetime type.
Lastly, we created a new column called “avg_rating”. The average rating for any given movie or tv show in the DataFrame is simply the average between its IMDb score and its Rotten Tomatoes score (now matching the same format as IMDb). This column was created by applying a function we created called “get_avg_rating”. We wanted to get the average rating between Rotten Tomatoes scores and IMDb scores so we could take into account both user and critic reviews of the movies/TV shows.
Analysis & Visualization
Which directors on Netflix receive the highest ratings for their movies and/or TV shows, on average?
As part of our goal to optimize time spent on the Netflix platform, we wanted to determine which directors on Netflix had the highest average ratings. After grouping by the director name(s) stored in the “director” column, we aggregated the mean of the average ratings for each of the director’s projects on Netflix and counted the amount of content each director had on Netflix.
Originally, we expected more well-known directors to have higher average ratings. However, as seen in the visualization, many of the directors who received the highest average ratings only had one title on Netflix’s streaming service. Since more prominent directors had a larger amount of content with a greater variety in average ratings, they came lower in the result due to lower-rated projects weighing down their overall rating.
Which month in the 2010s (2010–2019 inclusive) had content added to Netflix with the highest average ratings? And how did average ratings for monthly content added to Netflix change over time?
We wanted to determine which month between January 2010 and December 2019 had the highest average rating for content added to Netflix. The date added to Netflix was indicated in the “date_added” column in DateTime format. After setting this column as the index and resampling by month, the average rating was determined using the previously calculated “avg_rating” column for each month within the specified timeframe. After running this analysis on the data, it was determined that August 2013 had the highest average rating for content added with an average rating of 9.55. Meanwhile, February 2016 displayed the lowest average rating at 5.75.
As seen in the visualization above, average ratings of content per month steadily decreased over the course of the decade. One possible explanation for these results is that in the earlier years of the decade Netflix inconsistently added content to their platform. As such, values during these years had greater influence over the month’s average rating since fewer titles were added. In later years, a greater number of titles were added every month weighing down the average scores.
What month of the year does Netflix add content with the highest Rotten Tomatoes/IMDb scores?
Another aspect of Netflix we wanted to analyze was which month does Netflix typically add the highest rated content to the platform. Using the “date_added” column as an index and grouping by month, we were able to calculate the mean monthly Rotten Tomatoes/IMDb score of content across the entire time period.
This analysis showed that the month of June typically had content added with the highest average rating at 7.067. Meanwhile, August had the lowest average rated content added at 6.541. This result didn’t surprise us since June marks the beginning of summer vacation. Since many students have more time to spend on the Netflix platform, it follows that this is an optimal time for Netflix to add high quality content. In contrast, August marking the start of the school year where students have drastically less time to spend watching content, is the ideal time to add lower average rated content.
Do TV shows with a greater number of seasons receive higher ratings, on average, than those with fewer seasons?
In order to determine whether season duration plays a role in a title’s average rating, we first created a new DataFrame containing only TV shows. The new column was generated called “duration_bin” by applying a function we created called “get_duration_bin.” This function determines which of three bins a given title’s duration falls into: Less than 5 seasons, Between 5–10 seasons, or greater than 10 seasons. We then grouped on the values in the “duration_bin” column and aggregated the mean rating.
The result that shows which ran between 5–10 seasons had the highest average rating was unsurprising to us. This is because oftentimes TV shows with low ratings are cancelled before reaching five seasons. Conversely, shows that run longer than ten seasons begin to receive low ratings in their later years as consumers tire of the plot and stop watching the show. As such, 5–10 seasons is the optimal duration. This is useful information to Netflix users who debate between committing to shows with a greater number of seasons versus watching multiple shows with only one season.
What are the average ratings for Netflix’s movies and TV shows, across different target audience age groups? Among each of those groups, which movie or TV shows has the highest rating?
For each target audience age group, we wanted to find the movie and TV show with the highest ratings. Target audience age groups were indicated in the maturity_ratings column in the DataFrame. Movies and TV shows with the same target audience age group sometimes differed in their values within this column. For example, a movie might have “G” as their maturity rating, and the maturity rating for a TV show might be “TV-PG”, but the target audience age group is the same. On Netflix’s website, target audience age groups are broken down by Adults, Kids, and Teens. Using the guide, we created an additional column in the DataFrame (“aud_age_group”) to reflect those age groups. Once we had this column, we could find the movie and TV show with the highest Rotten Tomatoes/IMDb score for each target audience age group.
These results were not unexpected, as Breaking Bad and Avatar: The Last Airbender are incredibly popular shows. As seen in the DataFrame above, we found it interesting that all of the TV shows had higher ratings than the movies. To explore this further, we decided to look at the average rating for all movies and TV shows for each target audience age group, as well as the number of entries in the DataFrame that fall in each category.
Our previous result aligns well with this resulting DataFrame: On average, TV shows for adults, teens, and kids on Netflix have higher ratings than that of movies. As seen in the above image, this can be explained by the fact that Netflix’s content is made up of more movies than TV shows. Kids’ TV shows have the highest ratings, but they also make up a relatively small portion of Netflix’s content, meaning that a very highly rated TV show for Kids could skew the average upwards, and a poorly rated TV show would have the opposite effect. The visualization below reflects the distribution of movies and TV shows on Netflix for each age group.
Finally, is there are a significant difference between the average ratings of movies and TV shows?
Our results from the last question showed that TV shows on Netflix tend to have higher Rotten Tomatoes/IMDb score than that of movies on Netflix. To gain insight as to whether or not there is a significant difference between the mean rating of Netflix’s movies and the mean rating of Netflix’s TV shows, we decided to perform a t-test with these two groups. The ttest was run with the null hypothesis that the difference in the means is 0.
The resulting p-value from the t-test was about 4.192091149328965e-33. Because the p-value is below 0.05, we reject the null hypothesis that the difference in the means is 0. In other words, our assumption that there was not a significant difference between these two samples (movies and TV shows) was incorrect.
Testing
In order to complete our analysis, we needed to create several functions that helped us clean, organize, and transform the initial dataset to get the information we thought was relevant. Using assert statements, we verified that our functions would be able to transform the data in the ways that we intended. This helped us perform a proper analysis.
Conclusion
While working on this project, we were able to validate our results based on previous knowledge of the streaming platform and its content. We were able to support our results based on our outputs from questions that were related to similar criteria. For example, we discovered that Breaking Bad was the highest rated TV show for adults on Netflix. Since we know that Breaking Bad ran for 5 seasons, our conclusion that TV shows with 5–10 seasons tend to have higher ratings on average is supported.
Throughout our analysis, it was important to consider how the average ratings of content on Netflix were impacted by the sample size, and how the year in which the movie/TV show was first produced outdated the use of Rotten Tomatoes and IMDb scores. The initial dataset includes content produced in the early 1900s, and in our initial dataset, there were several rows that were missing either an IMDb or Rotten Tomatoes score from this time period. One reason for this may be due to the fact that content produced during this time differed drastically from what we see on TV and in theaters today. This could make it difficult for Rotten Tomatoes/IMDb to rate these titles using the same criteria as more modern movies and TV shows.
Source Code
Source code can be found on GitHub: https://github.com/emmafinkel/si330-finalproject