In this story, I will investigate the TMDB movies dataset which is collected between 1960 to 2015 with the information of title, budget, revenue, cast, director, genres, release date, release year, runtime, etc …
The primary goal of the project is making the exploratory data analysis using numpy, pandas, seaborn and matplotlib library. For this, we need the clean the data first. Previously, we should ask a question and find the answers inside this datasets. So, this purpose will help us with the cleaning process.
The original data source comes from Kaggle.
What are all times highest and lowest profit movie?
What is all times top 10 movies which earn the highest profit?
What are the highest profit movie and the total profit for each year?
What is the all times highest and lowest budget movie?
What is all times top 10 movies which have the highest budget?
What are the highest budget movie and the total budget for each year?
What is the All times highest and lowest revenue movie?
What is all times top 10 movies which have the highest revenue?
What are the highest budget movie and the total budget for each year?
Which genres most used from 1960 to 2015?
Which cast were more filmed?
Which director was most filmed?
What is the Number of movies released in each month? What is the total profit by month?
We will create the function to facilitate the answer the questions before going into exploratory data analysis.
This function is to find out the min and the max value of any given column. So, we can use this function on the budget, revenue, and profit to find out the highest and lowest values for given information.
The top_10 function calculates all times top 10 movies for any given columns, and also plot this information in a bar chart.
If we want to find out the total or highest value of any given column (budget, revenue, or profit) for each year separately then we could use the each_year_best function which is defined below. This function will plot the total and highest value of any given column for the last 15 years as default.
find_min_max('profit')
top_10('profit')
each_year_best('profit')
find_min_max('budget')
top_10('budget'
each_year_best('budget')
Let's write a function to find out the most filmed genres, cast or director
split_count_data("genres")
split_count_data("cast")
split_count_data("director")
Let’s try the found out if there is any correlation between "profit', 'budget', 'revenue', 'runtime', 'vote_count', 'popularity', 'release_year'
df_related= df[['profit','budget','revenue','runtime','vote_count','popularity','release_year']]
sns.pairplot(df_related, kind='reg')
We analysis the TMDB dataset which is collected between 1960 to 2015. Our goal here finding the answer utilizing this dataset. We could summaries this analysis result in the following items.
You may also found this project at my Medium and also in my Kaggle.