Importing Libraries
import pandas as pd
import numpy as np
Explanation: Import the essential libraries.
Loading the Dataset
df = pd.read_csv('/path_to_your_dataset.csv')
Explanation: Load the dataset into a Pandas DataFrame.
Display First Few Rows
df.head()
Explanation: Display the first five rows to understand the structure.
Display Last Few Rows
df.tail()
Explanation: Display the last five rows of the dataset.
Dataset Information
df.info()
Explanation: Get an overview, including data types and null values.
Descriptive Statistics
df.describe()
Explanation: Get statistics like mean, median, min, and max for each column.
Column Names
df.columns
Explanation: List all column names in the dataset.
Shape of the Dataset
df.shape
Explanation: Get the number of rows and columns.
Check for Null Values
df.isnull().sum()
Explanation: Count null values in each column.
Drop Rows with Null Values
df_cleaned = df.dropna()
Explanation: Remove rows with null values for a cleaner dataset.
Fill Null Values
df.fillna(value='Unknown', inplace=True)
Explanation: Fill null values with a placeholder.
Unique Values in a Column
df['column_name'].unique()
Explanation: Display unique values in a specific column.
Value Counts
df['column_name'].value_counts()
Explanation: Count the occurrences of each unique value in a column.
Filter Rows by Condition
df_filtered = df[df['column_name'] > some_value]
Explanation: Filter rows based on a condition.
Selecting Multiple Columns
df[['column1', 'column2']]
Explanation: Select and display specific columns.
Add a New Column
df['new_column'] = df['column1'] + df['column2']
Explanation: Add a new column by combining values from other columns.
Rename Columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
Explanation: Rename columns for better readability.
Sorting Values
df.sort_values(by='column_name', ascending=False)
Explanation: Sort the dataset by a specific column.
Drop a Column
df.drop('column_name', axis=1, inplace=True)
Explanation: Remove a specific column.
Group By and Aggregate
df.groupby('column_name').sum()
Explanation: Group by a column and apply an aggregate function like sum.
Calculate Mean of a Column
df['column_name'].mean()
Explanation: Calculate the mean of a specific column.
Calculate Median of a Column
df['column_name'].median()
Explanation: Calculate the median of a specific column.
Standard Deviation of a Column
df['column_name'].std()
Explanation: Calculate the standard deviation of a specific column.
Detecting Outliers
df[(df['column_name'] > upper_limit) | (df['column_name'] < lower_limit)]
Explanation: Detect outliers by specifying upper and lower limits.
Apply Custom Function
df['new_column'] = df['column_name'].apply(lambda x: x * 2)
Explanation: Apply a custom function to each value in a column.
Pivot Table
df.pivot_table(values='value_column', index='index_column', columns='column_name')
Explanation: Create a pivot table to analyze relationships.
Correlation Matrix
df.corr()
Explanation: Calculate the correlation matrix for numeric columns.
Visualizing with Histograms
df['column_name'].hist()
Explanation: Plot a histogram for a column to view the distribution.
Scatter Plot
df.plot.scatter(x='column_x', y='column_y')
Explanation: Create a scatter plot to see relationships between two columns.
Box Plot
df.boxplot(column='column_name')
Explanation: Generate a box plot to identify the spread and outliers.
Live Example of Data set Attached
DOWNLOAD from HERE – CLICK HERE
import pandas as pd
import numpy as np
Explanation: Load Pandas and NumPy libraries for data manipulation.
df = pd.read_csv('/path_to_your_dataset.csv')
Explanation: Load your dataset into a DataFrame.
df.head(10)
Explanation: Display the first 10 rows for an initial look at the data.
df.info()
Explanation: Get information on data types and missing values for each column.
df.describe()
Explanation: Get a summary of statistics for numerical columns.
df['Rank'].nunique()
Explanation: Count unique values in the Rank column.
df[df.duplicated(['Rank'])]
Explanation: Check for duplicate ranks, if any.
df.nsmallest(5, 'Rank')
Explanation: Show the top 5 ranked movies.
df['Title'].unique()
Explanation: List all unique movie titles.
df[df.duplicated(['Title'])]
Explanation: Check for duplicate movie titles, if any.
df['Genre'].unique()
Explanation: List all unique genres in the dataset.
df['Genre'].value_counts()
Explanation: Count the frequency of each genre.
df.groupby('Genre')['Rating'].mean().nlargest(3)
Explanation: Find genres with the highest average ratings.
df['Description'].apply(len).mean()
Explanation: Calculate the average length of movie descriptions.
df.loc[df['Description'].apply(len).nlargest(5).index, ['Title', 'Description']]
Explanation: Find movies with the longest descriptions.
df['Director'].nunique()
Explanation: Count unique directors in the dataset.
df['Director'].value_counts().head(5)
Explanation: Identify directors with the most movies in the dataset.
df.groupby('Director')['Revenue (Millions)'].mean().nlargest(5)
Explanation: Find directors with the highest average revenue.
df['Actors'].nunique()
Explanation: Count unique actors in the dataset.
from collections import Counter
actor_counts = Counter(", ".join(df['Actors']).split(", "))
actor_counts.most_common(5)
Explanation: Identify the actors who appear most frequently.
df['Year'].value_counts().sort_index()
Explanation: See how many movies were released each year.
df.groupby('Year')['Rating'].mean()
Explanation: Track average movie ratings over the years.
df[df['Year'] == 2016]
Explanation: List all movies released in a specific year.
df['Runtime (Minutes)'].mean()
Explanation: Calculate the average runtime of movies.
df[df['Runtime (Minutes)'] > 150]
Explanation: List movies with a runtime longer than 150 minutes.
df['Runtime (Minutes)'].plot(kind='hist', title='Runtime Distribution')
Explanation: Plot a histogram to see the distribution of movie runtimes.
df['Rating'].mean()
Explanation: Find the average rating across all movies.
df.nlargest(10, 'Rating')[['Title', 'Rating']]
Explanation: Display the 10 highest-rated movies.
df['Rating'].plot(kind='hist', title='Rating Distribution')
Explanation: Plot a histogram to visualize the distribution of ratings.
df[df['Votes'] > 500000]
Explanation: List movies with a high number of votes.
df[['Votes', 'Rating']].corr()
Explanation: Calculate the correlation between votes and ratings.
df['Revenue (Millions)'].sum()
Explanation: Calculate the total revenue for all movies.
df.nlargest(10, 'Revenue (Millions)')[['Title', 'Revenue (Millions)']]
Explanation: Display the top 10 highest revenue-generating movies.
df['Revenue (Millions)'].plot(kind='hist', title='Revenue Distribution')
Explanation: Plot a histogram to visualize revenue distribution.
df['Metascore'].mean()
Explanation: Find the average Metascore across all movies.
df.nlargest(5, 'Metascore')[['Title', 'Metascore']]
Explanation: Display the movies with the highest Metascore.
df.plot.scatter(x='Metascore', y='Revenue (Millions)', title='Revenue vs. Metascore')
Explanation: Create a scatter plot to observe the relationship between revenue and Metascore.
df.pivot_table(values='Rating', index='Genre', columns='Year', aggfunc='mean')
Explanation: Create a pivot table showing average ratings by genre and year.
sci_fi_movies = df[df['Genre'].str.contains('Sci-Fi')]
sci_fi_movies
Explanation: List all movies in the Sci-Fi genre.
df[['Rating', 'Revenue (Millions)']].corr()
Explanation: Calculate the correlation between rating and revenue.
df['Runtime (Minutes)'].plot(kind='box', title='Runtime Outliers')
Explanation: Use a box plot to identify runtime outliers.
df.groupby('Director')['Metascore'].mean().nlargest(5)
Explanation: Find directors with the highest average Metascores.
df.groupby('Genre')['Votes'].mean().nlargest(5)
Explanation: Find genres that received the highest average votes.
df.groupby('Genre')['Revenue (Millions)'].mean().sort_values().plot(kind='bar', title='Average Revenue by Genre')
Explanation: Plot average revenue per genre as a bar chart.
df.groupby('Year')['Revenue (Millions)'].sum().plot(kind='line', title='Total Revenue by Year')
Explanation: Plot total revenue over the years.
df['Decade'] = (df['Year'] // 10) * 10
df.groupby('Decade').size()
Explanation: Group movies by decade.
I’m a DevOps/SRE/DevSecOps/Cloud Expert passionate about sharing knowledge and experiences. I am working at Cotocus. I blog tech insights at DevOps School, travel stories at Holiday Landmark, stock market tips at Stocks Mantra, health and fitness guidance at My Medic Plus, product reviews at I reviewed , and SEO strategies at Wizbrand.
Please find my social handles as below;
Rajesh Kumar Personal Website
Rajesh Kumar at YOUTUBE
Rajesh Kumar at INSTAGRAM
Rajesh Kumar at X
Rajesh Kumar at FACEBOOK
Rajesh Kumar at LINKEDIN
Rajesh Kumar at PINTEREST
Rajesh Kumar at QUORA
Rajesh Kumar at WIZBRAND