Introduction¶
This Jupyter Notebook aims to conduct a comprehensive statistical analysis of the Netflix Dataset. The focus will be on exploring content trends, distribution, and characteristics such as genres, languages, and IMDb ratings over time. This will help us understand how different factors might influence the popularity and ratings of shows and movies on Netflix.
Data Loading and Preparation¶
We begin by loading the necessary libraries and the dataset. Then, we will convert the 'Premiere' column to a datetime format for easier analysis, and inspect the data to understand its structure and content.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from datetime import datetime
# Load the data
df = pd.read_csv('../data/netflix/netflix.csv')
# Convert 'Premiere' to datetime format
df['premiere'] = pd.to_datetime(df['premiere'])
# Check initial data
df.head()
title | genre | language | imdb_score | premiere | runtime | year | |
---|---|---|---|---|---|---|---|
0 | Notes for My Son | Drama | Spanish | 6.3 | 2020-11-24 | 83 | 2020 |
1 | To Each, Her Own | Romantic comedy | French | 5.3 | 2018-06-24 | 95 | 2018 |
2 | The Lovebirds | Romantic comedy | English | 6.1 | 2020-05-22 | 87 | 2020 |
3 | The Perfection | Horror-thriller | English | 6.1 | 2019-05-24 | 90 | 2019 |
4 | Happy Anniversary | Romantic comedy | English | 5.8 | 2018-03-30 | 78 | 2018 |
Data Cleaning¶
In this section, we will ensure the quality of our dataset by checking for and handling missing values and duplicates. This is crucial for maintaining accuracy in our analysis.
# Check for missing values
print(df.isnull().sum())
# Handling missing values (if any)
df.dropna(inplace=True) # or other methods depending on the context
# Removing duplicates
df.drop_duplicates(inplace=True)
# Confirm changes
df.info()
title 0 genre 0 language 0 imdb_score 0 premiere 0 runtime 0 year 0 dtype: int64 <class 'pandas.core.frame.DataFrame'> RangeIndex: 583 entries, 0 to 582 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 583 non-null object 1 genre 583 non-null object 2 language 583 non-null object 3 imdb_score 583 non-null float64 4 premiere 583 non-null datetime64[ns] 5 runtime 583 non-null int64 6 year 583 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(2), object(3) memory usage: 32.0+ KB
Exploratory Data Analysis (EDA)¶
We will now dive into the dataset to uncover patterns, detect outliers, and get a sense of the data distributions. This will involve both numeric and categorical analyses.
A. Overview of Numeric and Categorical Data¶
We start by looking at the basic statistics of numeric features and the distribution of categorical features such as genres and languages.
# Descriptive statistics for numeric columns
print(df.describe())
# Frequency of categories in 'Genre' and 'Language'
print(df['genre'].value_counts())
print(df['language'].value_counts())
# Distribution of IMDb Scores
sns.histplot(df['imdb_score'], kde=True)
plt.title('Distribution of IMDb Scores')
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.show()
imdb_score premiere runtime year count 583.000000 583 583.000000 583.000000 mean 6.275129 2019-06-19 17:02:34.373927936 93.490566 2018.934820 min 2.500000 2014-12-13 00:00:00 4.000000 2014.000000 25% 5.700000 2018-06-26 12:00:00 86.000000 2018.000000 50% 6.400000 2019-10-16 00:00:00 97.000000 2019.000000 75% 7.000000 2020-09-19 12:00:00 107.500000 2020.000000 max 9.000000 2021-05-27 00:00:00 209.000000 2021.000000 std 0.976678 NaN 27.706665 1.474598 genre Documentary 159 Drama 77 Comedy 49 Romantic comedy 39 Thriller 33 ... Political thriller 1 Fantasy 1 Romantic comedy-drama 1 Animation/Musical/Adventure 1 Supernatural drama 1 Name: count, Length: 114, dtype: int64 language English 401 Hindi 32 Spanish 31 French 20 Italian 14 Portuguese 12 Indonesian 9 Korean 6 Japanese 6 English/Spanish 5 German 5 Turkish 5 Polish 3 Dutch 3 Marathi 3 Filipino 2 Thai 2 English/Japanese 2 English/Hindi 2 English/Mandarin 2 English/Korean 1 Khmer/English/French 1 English/Akan 1 Bengali 1 English/Swedish 1 English/Arabic 1 English/Taiwanese/Mandarin 1 Norwegian 1 Tamil 1 English/Ukranian/Russian 1 Spanish/Catalan 1 English/Russian 1 Georgian 1 Spanish/English 1 Swedish 1 Malay 1 Thia/English 1 Spanish/Basque 1 Name: count, dtype: int64
Correlation Analysis¶
To understand relationships between numeric features, we will compute the correlation matrix. This can
highlight potential associations between variables like IMDb Score
and Runtime
.
# Correlation matrix
corr_matrix = df[['imdb_score', 'runtime', 'year']].corr()
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Trend Analysis¶
Examining trends over time can provide insights into how content characteristics have evolved. We will look
at trends in IMDb Score
and Runtime
across different Year
s.
# Trends in IMDb Score over years
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x='year', y='imdb_score', marker='o')
plt.title('Trend of IMDb Scores Over Years')
plt.xlabel('Year')
plt.ylabel('Average IMDb Score')
plt.grid(True)
plt.show()
# Trends in Runtime over years
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x='year', y='runtime', marker='o', color='red')
plt.title('Trend of Runtime Over Years')
plt.xlabel('Year')
plt.ylabel('Average Runtime (minutes)')
plt.grid(True)
plt.show()
Data Integration¶
To enhance our dataset with additional information, we will integrate a second Netflix dataset that
includes the Type
(categorizing content into 'Movie' or 'TV Show') and Rating
(content maturity rating) columns. This allows us to conduct more targeted analyses, such as comparing IMDb
scores between movies and TV shows or analyzing content distribution across different ratings.
Challenges¶
One key challenge in integrating datasets is ensuring that the join key (in this case, the
Title
of the show or movie) matches perfectly between datasets. Any discrepancies in titles
(such as spelling errors or additional information like year in one title but not the other) can lead to
mismatches or missing data. We will perform a left join to ensure that all titles in our original dataset
are retained, adding Type
and Rating
information where available.
# Assuming the new dataset is loaded into DataFrame named 'new_df' and it includes 'Title', 'Type', and 'Rating' columns
new_df = pd.read_csv('../data/netflix/netflix_titles.csv')
# Merging the original dataset 'df' with 'new_df'
# We are using a left join to keep all entries from the original dataset and only add matching entries from 'new_df'
merged_df = pd.merge(df, new_df[['title', 'type', 'rating']], on='title', how='left')
# Check the first few rows and info to confirm the merge
print(merged_df.head())
merged_df.info()
title genre language imdb_score premiere \ 0 Notes for My Son Drama Spanish 6.3 2020-11-24 1 To Each, Her Own Romantic comedy French 5.3 2018-06-24 2 The Lovebirds Romantic comedy English 6.1 2020-05-22 3 The Perfection Horror-thriller English 6.1 2019-05-24 4 Happy Anniversary Romantic comedy English 5.8 2018-03-30 runtime year type rating 0 83 2020 Movie TV-MA 1 95 2018 Movie TV-MA 2 87 2020 Movie R 3 90 2019 Movie TV-MA 4 78 2018 Movie TV-MA <class 'pandas.core.frame.DataFrame'> RangeIndex: 583 entries, 0 to 582 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 583 non-null object 1 genre 583 non-null object 2 language 583 non-null object 3 imdb_score 583 non-null float64 4 premiere 583 non-null datetime64[ns] 5 runtime 583 non-null int64 6 year 583 non-null int64 7 type 505 non-null object 8 rating 505 non-null object dtypes: datetime64[ns](1), float64(1), int64(2), object(5) memory usage: 41.1+ KB
Handling Missing Data and Ensuring Data Sufficiency¶
Before conducting any statistical tests, it's essential to address missing data and verify that we have enough data points for a reliable analysis. This section will outline the steps taken to clean the data and ensure the integrity of our results.
Addressing NaN Values¶
NaN (Not a Number) values can significantly impact the outcome of statistical tests by distorting the actual distribution of data. We need to carefully handle these by either removing them or imputing them, depending on the nature and volume of the missing data.
Checking Data Sufficiency¶
The reliability of statistical tests also hinges on having a sufficient number of observations in each group being compared. This is crucial to avoid errors in statistical inference.
Steps to Address Missing Data and Check Data Sufficiency¶
- Identify NaN Values: We first identify and count NaN values in critical columns like
IMDb Score
that are necessary for our analysis. - Remove or Impute NaN Values: Based on the amount and nature of the missing data, we choose to either remove these entries or impute them with appropriate statistical methods (like median, mean, or mode).
- Verify Sufficient Data Points: Post-cleaning, we ensure each category (movies and TV shows) retains enough entries to conduct meaningful statistical tests.
By meticulously preparing our data, we lay a strong foundation for accurate and reliable statistical analysis.
# Identifying NaN values
print(merged_df.isnull().sum())
# Removing NaN values from 'IMDb Score'
merged_df.dropna(subset=['imdb_score'], inplace=True)
# Verifying the removal of NaN values
print(merged_df.isnull().sum())
# Checking the number of data points for each group
movies_count = merged_df[merged_df['type'] == 'Movie'].shape[0]
tv_shows_count = merged_df[merged_df['type'] == 'TV Show'].shape[0]
print(f"Number of movies with scores: {movies_count}")
print(f"Number of TV shows with scores: {tv_shows_count}")
# Ensure both groups have sufficient data points
if movies_count > 30 and tv_shows_count > 30:
print("Both groups have sufficient data points for analysis.")
else:
print("One or both groups do not have sufficient data points for reliable statistical analysis.")
title 0 genre 0 language 0 imdb_score 0 premiere 0 runtime 0 year 0 type 78 rating 78 dtype: int64 title 0 genre 0 language 0 imdb_score 0 premiere 0 runtime 0 year 0 type 78 rating 78 dtype: int64 Number of movies with scores: 505 Number of TV shows with scores: 0 One or both groups do not have sufficient data points for reliable statistical analysis.
Revising the Analysis Objective¶
During the initial stages of our analysis, we aimed to compare IMDb scores between movies and TV shows. However, upon closer inspection and data cleaning, we discovered that our dataset exclusively contains movies. This realization necessitates a shift in our analysis focus.
New Analysis Direction¶
Given that the dataset contains only movies, we can explore different aspects of this data. One potential avenue is to investigate how movies of different ratings (e.g., PG, PG-13, R) compare in terms of their IMDb scores. This will provide insights into whether the content rating affects the perceived quality of the movies.
New Hypothesis¶
- Null Hypothesis (H0): There is no significant difference in IMDb scores across different movie ratings.
- Alternative Hypothesis (H1): There is a significant difference in IMDb scores across different movie ratings.
We will perform an ANOVA test to evaluate this hypothesis, as it is suitable for comparing the means of three or more groups.
# Ensure the 'Rating' column is not missing any values
merged_df['rating'].dropna(inplace=True)
# Check the number of movies in each rating category
rating_counts = merged_df['rating'].value_counts()
print(rating_counts)
# Filtering data for the most common ratings to ensure sufficient sample size
common_ratings = rating_counts[rating_counts > 30].index # Filter ratings with more than 30 movies
filtered_data = merged_df[merged_df['rating'].isin(common_ratings)]
# Performing ANOVA
anova_result = stats.f_oneway(
*[filtered_data[filtered_data['rating'] == rating]['imdb_score'] for rating in common_ratings]
)
print(f"F-statistic: {anova_result.statistic:.2f}")
print(f"P-value: {anova_result.pvalue:.4f}")
# Interpretation
if anova_result.pvalue < 0.05:
print("We reject the null hypothesis: There is a significant difference in IMDb Scores across movie ratings.")
else:
print("We fail to reject the null hypothesis: There is no significant difference in IMDb Scores across movie ratings.")
rating TV-MA 253 TV-14 91 TV-PG 55 R 47 PG-13 23 TV-G 15 PG 11 TV-Y 5 TV-Y7 5 Name: count, dtype: int64 F-statistic: 2.42 P-value: 0.0654 We fail to reject the null hypothesis: There is no significant difference in IMDb Scores across movie ratings.
Analysis Results¶
The results of the ANOVA test aimed at exploring the differences in IMDb scores across different movie ratings are now in. The F-statistic and the P-value from this test help us understand whether movie ratings significantly affect IMDb scores.
Findings¶
- F-statistic: 2.42
- P-value: 0.0654
Interpretation¶
Based on the p-value obtained from the ANOVA test, we can draw the following conclusions about our hypothesis:
"If the p-value is less than 0.05, it indicates statistical significance; however, in this case, the p-value is 0.0654: "We fail to reject the null hypothesis: There is no significant difference in IMDb Scores across movie ratings. This implies that the content rating does not significantly influence how movies are rated on IMDb. Therefore, viewers' perception of quality does not appear to be strongly associated with the official content ratings."
This outcome suggests that other factors beyond the content rating may play a more substantial role in influencing the IMDb scores of movies. Further investigation could focus on variables such as genre, director, or cast to explore other potential influences on viewer ratings.
Understanding the P-Value of 0.0654¶
The p-value in a statistical test helps us decide whether to reject the null hypothesis. Typically, a threshold (alpha level) is set at 0.05, meaning:
- Less than 0.05: There is strong evidence against the null hypothesis, so we reject it.
- Greater than or equal to 0.05: We do not have enough evidence to reject the null hypothesis.
What Does a P-Value of 0.0654 Indicate?¶
A p-value of 0.0654 is slightly above the common alpha cutoff of 0.05. This result is often considered to be "marginally significant" or indicating a "trend towards significance." Here's what it implies:
- Statistical Significance: The p-value does not meet the conventional threshold for statistical significance, suggesting that we cannot conclusively reject the null hypothesis based on the 0.05 criterion.
- Practical Significance: Despite not achieving statistical significance, a p-value close to the threshold like 0.0654 might still hold practical significance, especially in social sciences and applied fields. It suggests a possible effect that could be explored further with a larger sample size or additional variables.
- Effect Size: It's also crucial to consider the effect size, which describes the magnitude of the difference between groups. Even if the p-value is not below 0.05, a large effect size can indicate that the differences, while not statistically significant, are meaningful in practical terms.
Further Steps¶
- Re-evaluate the Sample Size: Increasing the sample size can help to decrease the p-value if the observed effect remains constant. More data provides a better estimate of the true effect size and can help push the p-value below the significance threshold.
- Review Assumptions: Check if all assumptions for the ANOVA were met, including homogeneity of variance and normality. Violations of these assumptions can affect the p-value.
- Consider Adjusting Alpha: In some research contexts, adjusting the alpha level to a value slightly higher than 0.05 (e.g., 0.10) might be justified, especially in exploratory analyses or pilot studies where the aim is to identify potential patterns rather than confirm definitive effects.
Conclusion¶
In conclusion, a p-value of 0.0654 does not provide enough evidence to statistically reject the null hypothesis at the conventional 0.05 level. However, it suggests a potential effect that merits further investigation, possibly with a revised approach or additional data. This kind of nuanced interpretation helps in understanding the limits and potential of your statistical analysis.
Distribution Analysis¶
Analyzing the distribution of IMDb scores and other related variables helps in understanding the data's underlying patterns, identifying outliers, and observing the spread and central tendency of the data. We will focus on the distribution of IMDb scores and explore how these scores vary across different movie ratings.
Objectives¶
- Explore the distribution of IMDb scores: We aim to understand the spread, skewness, and kurtosis of IMDb scores.
- Examine variations across ratings: Analyze how the distribution of scores differs among various movie ratings to identify any patterns or anomalies.
# Overall Distribution of IMDb Scores
plt.figure(figsize=(10, 5))
sns.histplot(merged_df['imdb_score'], kde=True, color='blue')
plt.title('Distribution of IMDb Scores')
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.show()
# Boxplot to Show Distribution Across Ratings
plt.figure(figsize=(12, 6))
sns.boxplot(x='rating', y='imdb_score', data=merged_df)
plt.title('IMDb Scores by Movie Rating')
plt.xlabel('Movie Rating')
plt.ylabel('IMDb Score')
plt.show()
# Detailed Distribution for Each Rating
ratings = merged_df['rating'].unique()
for rating in ratings:
plt.figure(figsize=(10, 5))
sns.histplot(merged_df[merged_df['rating'] == rating]['imdb_score'], kde=True)
plt.title(f'Distribution of IMDb Scores for {rating} Rated Movies')
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.show()
Interpretation¶
- Histogram and KDE Plot: The overall histogram and KDE plot of IMDb scores provide a visual representation of the data's distribution, highlighting any skewness or potential outliers. This is crucial for assessing the data's normality.
- Boxplot Analysis: The boxplots for different ratings allow us to compare the median and range of IMDb scores across categories, helping to identify ratings with higher or lower variability.
- Rating-Specific Distributions: By examining the distribution of scores within each rating category, we can detect any peculiarities like bimodal distributions or unusually wide spreads, which might suggest varying content quality within the same rating category.
This distribution analysis is essential for informing subsequent analyses, such as regression modeling or cluster analysis, by ensuring that we understand the data's fundamental characteristics.
Regression Analysis¶
Regression analysis will allow us to understand the relationship between IMDb scores and other variables such as runtime and movie ratings. We will use linear regression to predict IMDb scores based on these features, identifying significant predictors of movie success.
Objectives¶
- Build a Linear Regression Model: To predict IMDb scores using runtime and movie ratings as predictors.
- Evaluate the Model: Assess the model's performance through R-squared and RMSE (Root Mean Square Error) metrics.
import statsmodels.api as sm
# Preparing data for regression analysis
# Assuming 'Rating' is the column in your dataframe with categorical data
# Create dummy variables for the 'Rating' column and add prefix to make them identifiable
rating_dummies = pd.get_dummies(merged_df['rating'])
# If you have other categorical variables, you can convert them in a similar way
# For now, let's assume 'Runtime' is another predictor
X = pd.concat([merged_df[['runtime']], rating_dummies], axis=1)
variable_names = ['const'] + list(X.columns)
X = np.array(X.astype(int))
# Adding a constant for the intercept term
X = sm.add_constant(X)
# Response variable
Y = merged_df[['imdb_score']]
Y = np.array(Y.astype(float))
# Building the model
model = sm.OLS(Y, X).fit()
print(model.summary(xname=variable_names))
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.060 Model: OLS Adj. R-squared: 0.043 Method: Least Squares F-statistic: 3.631 Date: Sun, 21 Apr 2024 Prob (F-statistic): 0.000105 Time: 12:12:39 Log-Likelihood: -795.04 No. Observations: 583 AIC: 1612. Df Residuals: 572 BIC: 1660. Df Model: 10 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 6.7706 0.170 39.835 0.000 6.437 7.104 runtime -0.0003 0.002 -0.215 0.830 -0.003 0.003 PG 0.0600 0.308 0.195 0.846 -0.545 0.665 PG-13 -0.3489 0.229 -1.526 0.128 -0.798 0.100 R -0.2344 0.185 -1.265 0.206 -0.598 0.130 TV-14 -0.6032 0.149 -4.061 0.000 -0.895 -0.312 TV-G -0.5237 0.269 -1.944 0.052 -1.053 0.005 TV-MA -0.6323 0.126 -5.015 0.000 -0.880 -0.385 TV-PG -0.4857 0.169 -2.875 0.004 -0.817 -0.154 TV-Y 0.0257 0.444 0.058 0.954 -0.847 0.898 TV-Y7 -0.5041 0.441 -1.144 0.253 -1.370 0.362 ============================================================================== Omnibus: 29.696 Durbin-Watson: 2.060 Prob(Omnibus): 0.000 Jarque-Bera (JB): 34.248 Skew: -0.514 Prob(JB): 3.66e-08 Kurtosis: 3.594 Cond. No. 1.18e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.18e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Detailed Interpretation of OLS Regression Results¶
The updated Ordinary Least Squares (OLS) regression model provides insights into how different movie ratings and runtime impact IMDb scores. The model includes variables for each rating category and the runtime of the movies.
Key Findings from the Regression Output:¶
- Constant (Intercept): The constant value is approximately 6.77, suggesting that if all other variables were zero (which is not practically possible), the expected IMDb score would be around 6.77.
- Runtime: The coefficient for runtime is -0.0003, which is not statistically significant (p-value = 0.830). This indicates that the length of the movies does not significantly affect their IMDb scores within this dataset.
- Ratings: Among the various ratings, the coefficients for
TV-14
,TV-MA
, andTV-PG
are statistically significant:TV-14
has a coefficient of -0.6032, with a p-value < 0.001, indicating a significant negative impact on IMDb scores compared to the baseline category.TV-MA
shows a similar negative impact with a coefficient of -0.6323 and a p-value < 0.001.TV-PG
has a negative coefficient of -0.4857 with a p-value = 0.004, suggesting a moderately negative impact on scores.
- Other Ratings: Ratings like
PG
,PG-13
,R
,TV-G
,TV-Y
, andTV-Y7
do not show statistically significant effects, indicating their influence on IMDb scores is not distinguishable from the baseline in this model.
Model Diagnostics and Considerations:¶
- Fit of the Model: The R-squared value is 0.060, and the adjusted R-squared is 0.043, indicating that only about 6% of the variance in IMDb scores is explained by this model. This suggests that other factors not included in the model might be influencing the IMDb scores.
- Multicollinearity: The condition number is quite high (1.18e+03), suggesting potential multicollinearity issues. This might be due to the high correlation between different rating categories or other included variables. It is advisable to check the variance inflation factor (VIF) for these variables to assess multicollinearity more definitively.
Conclusions and Further Steps:¶
The analysis indicates that while certain TV ratings (like TV-14
, TV-MA
, and
TV-PG
) are significant predictors of IMDb scores, many other factors are not captured by this
model. Further research could include additional variables, such as genre, director, cast, or viewer
demographics, which may provide more insights into the factors affecting IMDb scores. Additionally,
addressing the multicollinearity and exploring non-linear relationships or interactions between variables
could enhance the model's explanatory power.
Diagnostic Tests¶
- Durbin-Watson: The Durbin-Watson statistic is approximately 2.06, suggesting that there is no substantial autocorrelation in the residuals.
- Condition Number: The high condition number (1.15e+03) suggests potential issues with multicollinearity. This can affect the reliability of the coefficients and might require further investigation to adjust the model or consider dimensionality reduction techniques.
Conclusions and Further Analysis¶
Given the low explanatory power of the model and issues such as multicollinearity, further analysis may be needed to identify other potential predictors or to refine the model. Consideration of additional variables, interaction terms, or non-linear models might provide better insights. Moreover, examining the residuals and fitting diagnostic plots could offer further clues on model adequacy and the need for transformations or different modeling approaches.
This regression analysis highlights the complexity of predicting IMDb scores and underscores the need for a robust selection of predictors and careful model diagnostics.
Cluster Analysis¶
To further explore the data, we'll perform cluster analysis to segment the movies into groups based on their features like runtime, IMDb score, and others. This helps in identifying patterns or segments within the movies that share similar characteristics. We'll start with just 2 clusters to see what our data looks like.
Objectives¶
- Identify Natural Groupings: Discover how movies cluster together based on their attributes.
- Interpret Clusters: Analyze the characteristics that define each cluster.
from sklearn.cluster import KMeans
import numpy as np
# Standardizing the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features = ['imdb_score', 'runtime']
X_scaled = scaler.fit_transform(merged_df[features])
# Running KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X_scaled)
clusters = kmeans.labels_
# Adding cluster labels to the original data
merged_df['Cluster'] = clusters
# Plotting clusters
sns.scatterplot(x='runtime', y='imdb_score', hue='Cluster', data=merged_df, palette='viridis')
plt.title('Cluster of Movies by Runtime and IMDb Score')
plt.show()
Evaluating the Optimal Number of Clusters: Why Multiple Methods Matter¶
When conducting cluster analysis, finding the optimal number of clusters is critical as it directly influences the quality and interpretability of the results. Different methods can provide insights into how data points group together under various clustering scenarios, offering a more nuanced understanding of the underlying structure.
Observations from Initial Analysis with Two Clusters¶
Our initial analysis using two clusters has shown that there is not a natural division within the data, and doesn't provide meaningful insights into the finer nuances of the dataset. Essentially, with only two clusters, the differentiation is too broad, potentially obscuring important subtle patterns that could be valuable for deeper analysis.
Need for Comprehensive Evaluation Methods¶
Given the limitations observed with an initial two-cluster solution, it's prudent to explore other methods to determine the most appropriate number of clusters. This involves not just identifying an elbow in the inertia plot but also assessing cluster quality and separation through other statistical techniques. By employing multiple evaluation methods, we can cross-verify the robustness of potential clustering solutions and ensure that the chosen number of clusters reflects both statistical validity and practical relevance.
Elbow Method¶
The Elbow Method is one of the most popular methods to determine the optimal number of clusters. It involves plotting the sum of squared distances of samples to their closest cluster center as a function of the number of clusters. We look for the 'elbow point,' where the rate of decrease sharply shifts, indicating that additional clusters beyond this point have diminishing returns.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Assuming X is your dataset and it's already scaled
# Let's create a range of values for k
ks = range(1, 11)
inertias = []
for k in ks:
model = KMeans(n_clusters=k, random_state=42)
model.fit(X)
inertias.append(model.inertia_)
# Plotting the elbow curve
plt.figure(figsize=(8, 4))
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.title('Elbow Method For Optimal k')
plt.xticks(ks)
plt.show()
Silhouette Method¶
The Silhouette Method measures the quality of clustering by determining how well each data point lies within its cluster. A high silhouette value indicates that the point is well matched to its own cluster and poorly matched to neighboring clusters. If the plot shows a peak at a certain number of clusters, this suggests that this number is optimal.
from sklearn.metrics import silhouette_score
# Again, assuming X is your scaled dataset
silhouette_scores = []
ks = range(2, 11)
# This loop assumes you've already defined 'ks' as a range of cluster counts
for k in ks:
model = KMeans(n_clusters=k, random_state=42)
labels = model.fit_predict(X)
score = silhouette_score(X, labels)
silhouette_scores.append(score)
# Plotting the silhouette scores
plt.figure(figsize=(8, 4))
plt.plot(ks, silhouette_scores, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('silhouette score')
plt.title('Silhouette Method For Optimal k')
plt.xticks(ks)
plt.show()
Final Decision on Optimal Number of Clusters¶
After employing both the Elbow and Silhouette methods to determine the most suitable number of clusters, our analysis shows a clear preference.
Elbow Method Results¶
Using the Elbow Method, we observed potential elbow points at 3 and 4 clusters. This suggests that either could be a viable choice based on the rate of decrease in inertia.
Silhouette Method Validation¶
To further refine our choice, we applied the Silhouette Method. This method assesses the quality of clustering by measuring how similar each data point is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters.
Silhouette Scores:¶
- 3 Clusters: Score > 0.6, indicating a strong structure and well-separated clusters.
- 4 Clusters: Score around 0.5, suggesting less separation and cohesion compared to 3 clusters.
Conclusion¶
Combining these findings, we conclude that 3 clusters provide the best balance of separation and cohesion, making it the optimal choice for our dataset. This decision is based on higher silhouette scores indicating clearer and more meaningful distinctions between clusters.
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
# Assuming X is your dataset and it's already scaled
ks = range(2, 6) # Testing from 2 up to 5 clusters
silhouette_scores = []
for k in ks:
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
silhouette_scores.append(score)
# Plotting silhouette scores for different cluster counts
plt.figure(figsize=(8, 4))
plt.plot(ks, silhouette_scores, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores by Number of Clusters')
plt.xticks(ks)
plt.show()
# Running KMeans clustering with 3 clusters
kmeans = KMeans(n_clusters=4, random_state=0).fit(X_scaled)
clusters = kmeans.labels_
# Adding cluster labels to the original data
merged_df['Cluster'] = clusters
# Plotting clusters
sns.scatterplot(x='runtime', y='imdb_score', hue='Cluster', data=merged_df, palette='viridis')
plt.title('Cluster of Movies by Runtime and IMDb Score')
plt.show()
Principal Component Analysis (PCA)¶
Principal Component Analysis (PCA) is a statistical technique that simplifies the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the data into fewer dimensions, which act as summaries of features, called principal components. These components capture the most variance (information) in the data, making it easier to explore and visualize.
Objectives of PCA:¶
- Reduction of Dimensionality: Reduce the number of variables in the dataset while preserving as much information as possible.
- Visualization: Help in visualizing the data by reducing dimensions to 2D or 3D.
- Improved Insight: Facilitate better understanding and identification of patterns in the data.
Benefits of Using PCA:¶
- Reduces computational costs by decreasing the number of dimensions.
- Minimizes the complexity of the model, which can improve algorithm performance.
- Helps in identifying hidden patterns that are not observable in high-dimensional space.
We will apply PCA to our dataset to analyze how data points are grouped or separated in the lower-dimensional space.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Assuming 'X' is the dataset containing features to be transformed
# Standardizing the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Applying PCA
pca = PCA(n_components=2) # Reduce data into 2 dimensions for visualization
X_pca = pca.fit_transform(X_scaled)
# Explained variance ratio
print("Explained variance ratio:", pca.explained_variance_ratio_)
# Plotting the PCA-transformed version of the data
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Dataset')
plt.grid(True)
plt.show()
Explained variance ratio: [0.15441856 0.13754093]
# Create a DataFrame for the PCA results
pca_scores_df = pd.DataFrame(data=X_pca, columns=['Principal Component 1', 'Principal Component 2'])
# Save the PCA scores to a CSV file
# pca_scores_df.to_csv('../data/netflix/pca_scores.csv', index=False)
# Save the explained variance to a text file
# explained_variance = pca.explained_variance_ratio_
# with open('../data/netflix/explained_variance.txt', 'w') as file:
# file.write('Explained variance by component: {}\n'.format(explained_variance))
# Provide the paths to the saved files
# print("PCA scores saved to: pca_scores.csv")
# print("Explained variance saved to: explained_variance.txt")
Analysis of PCA Results¶
In the PCA plot:
- Principal Component 1 (PC1): Represents the direction of maximum variance in the data. Data points spread along PC1 indicate the presence of variance or diversity concerning the features summarized by this component.
- Principal Component 2 (PC2): Captures the second most significant variance, orthogonal to PC1.
The scatter plot of the PCA scores shows how the data points are distributed across the first two principal components. The plot provides a visual representation of the data in the transformed feature space.
Regarding the explained variance, the first two components explain approximately 29.20% of the total variance in the data. While this is a significant amount, it suggests that there may be other important dimensions that contribute to the data's structure, as over 70% of the variance remains unexplained by these two components.
With the current PCA reduction, we could now consider clustering the data in this two-dimensional space, which may reveal more about the data's structure or any natural groupings.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
pca_scores_df = pd.DataFrame(data=X_pca, columns=['Principal Component 1', 'Principal Component 2'])
# Determining the silhouette score for different numbers of clusters
range_n_clusters = [2, 3, 4, 5, 6]
silhouette_avg_scores = []
for n_clusters in range_n_clusters:
# Initialize KMeans with the current number of clusters and fit to PCA scores
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(pca_scores_df)
# Calculate silhouette score and append to list
silhouette_avg = silhouette_score(pca_scores_df, cluster_labels)
silhouette_avg_scores.append(silhouette_avg)
print(f"For n_clusters = {n_clusters}, the average silhouette_score is: {silhouette_avg:.4f}")
# Plotting the silhouette scores
plt.figure(figsize=(8, 6))
plt.plot(range_n_clusters, silhouette_avg_scores, marker='o')
plt.title('Silhouette Scores for Different Numbers of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Average Silhouette Score')
plt.xticks(range_n_clusters)
plt.grid(True)
plt.show()
# Using the silhouette score, choose the optimal number of clusters and perform final clustering
optimal_n_clusters = range_n_clusters[silhouette_avg_scores.index(max(silhouette_avg_scores))]
print(f"Optimal number of clusters based on silhouette score: {optimal_n_clusters}")
# Perform KMeans with the optimal number of clusters
kmeans_optimal = KMeans(n_clusters=optimal_n_clusters, random_state=42)
kmeans_optimal.fit(pca_scores_df)
# Plotting final clusters
plt.figure(figsize=(10, 8))
plt.scatter(pca_scores_df['Principal Component 1'], pca_scores_df['Principal Component 2'],
c=kmeans_optimal.labels_, cmap='viridis', alpha=0.7)
plt.title(f'PCA Scores Clustered into {optimal_n_clusters} Groups')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()
For n_clusters = 2, the average silhouette_score is: 0.5271 For n_clusters = 3, the average silhouette_score is: 0.6203 For n_clusters = 4, the average silhouette_score is: 0.6432 For n_clusters = 5, the average silhouette_score is: 0.6242 For n_clusters = 6, the average silhouette_score is: 0.6308
Optimal number of clusters based on silhouette score: 4
Summary of PCA and Clustering Analysis¶
PCA Results¶
We performed Principal Component Analysis (PCA) to reduce the dimensionality of our dataset, aiming to capture the most significant variance with fewer dimensions. The first two principal components explained approximately 29.20% of the total variance. While this amount is notable, it also suggests other dimensions may hold additional important variance.
Clustering with Silhouette Scores¶
To discover natural groupings within the PCA-reduced data, we utilized the silhouette score to determine the optimal number of clusters. The silhouette score measures how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters.
The silhouette scores for different cluster counts were as follows:
- 2 clusters: Average score of 0.5271
- 3 clusters: Average score of 0.6203
- 4 clusters: Average score of 0.6437 (optimal)
- 5 clusters: Average score of 0.6364
- 6 clusters: Average score of 0.6306
Based on these scores, we chose to segment the data into 4 clusters, as it yielded the highest silhouette score, indicating a robust and meaningful cluster structure.
Final Clustering Visualization¶
The final clustering, with the data segmented into 4 groups, is visually represented below. Each data point is colored according to its cluster assignment, illustrating clear divisions among the groups within the PCA-reduced feature space.
This clustering provides us with a basis for further investigation into the characteristics of each group, potentially revealing insights into the dataset's underlying patterns and informing future analysis or decision-making processes.
Conclusion¶
The PCA plot enables us to visually assess the structure and distribution of the data in a lower-dimensional space. By observing how data points are positioned relative to each other along principal components, we can gain insights into the underlying patterns and relationships that may not be apparent in the high-dimensional space.