Importing Libraries

import pandas as pd
import numpy as np

Explanation: Import the essential libraries.

Loading the Dataset

df = pd.read_csv('/path_to_your_dataset.csv')

Explanation: Load the dataset into a Pandas DataFrame.

Display First Few Rows

df.head()

Explanation: Display the first five rows to understand the structure.

Display Last Few Rows

df.tail()

Explanation: Display the last five rows of the dataset.

Dataset Information

df.info()

Explanation: Get an overview, including data types and null values.

Descriptive Statistics

df.describe()

Explanation: Get statistics like mean, median, min, and max for each column.

Column Names

df.columns

Explanation: List all column names in the dataset.

Shape of the Dataset

df.shape

Explanation: Get the number of rows and columns.

Check for Null Values

df.isnull().sum()

Explanation: Count null values in each column.

Drop Rows with Null Values

df_cleaned = df.dropna()

Explanation: Remove rows with null values for a cleaner dataset.

Fill Null Values

df.fillna(value='Unknown', inplace=True)

Explanation: Fill null values with a placeholder.

Unique Values in a Column

df['column_name'].unique()

Explanation: Display unique values in a specific column.

Value Counts

df['column_name'].value_counts()

Explanation: Count the occurrences of each unique value in a column.

Filter Rows by Condition

df_filtered = df[df['column_name'] > some_value]

Explanation: Filter rows based on a condition.

Selecting Multiple Columns

df[['column1', 'column2']]

Explanation: Select and display specific columns.

Add a New Column

df['new_column'] = df['column1'] + df['column2']

Explanation: Add a new column by combining values from other columns.

Rename Columns

df.rename(columns={'old_name': 'new_name'}, inplace=True)

Explanation: Rename columns for better readability.

Sorting Values

df.sort_values(by='column_name', ascending=False)

Explanation: Sort the dataset by a specific column.

Drop a Column

df.drop('column_name', axis=1, inplace=True)

Explanation: Remove a specific column.

Group By and Aggregate

df.groupby('column_name').sum()

Explanation: Group by a column and apply an aggregate function like sum.

Calculate Mean of a Column

df['column_name'].mean()

Explanation: Calculate the mean of a specific column.

Calculate Median of a Column

df['column_name'].median()

Explanation: Calculate the median of a specific column.

Standard Deviation of a Column

df['column_name'].std()

Explanation: Calculate the standard deviation of a specific column.

Detecting Outliers

df[(df['column_name'] > upper_limit) | (df['column_name'] < lower_limit)]

Explanation: Detect outliers by specifying upper and lower limits.

Apply Custom Function

df['new_column'] = df['column_name'].apply(lambda x: x * 2)

Explanation: Apply a custom function to each value in a column.

Pivot Table

df.pivot_table(values='value_column', index='index_column', columns='column_name')

Explanation: Create a pivot table to analyze relationships.

Correlation Matrix

df.corr()

Explanation: Calculate the correlation matrix for numeric columns.

Visualizing with Histograms

df['column_name'].hist()

Explanation: Plot a histogram for a column to view the distribution.

Scatter Plot

df.plot.scatter(x='column_x', y='column_y')

Explanation: Create a scatter plot to see relationships between two columns.

Box Plot

df.boxplot(column='column_name')

Explanation: Generate a box plot to identify the spread and outliers.

Live Example of Data set Attached

DOWNLOAD from HERE – CLICK HERE

Exploring Dataset with Pandas and NumPy in Jupyter Notebook

1. Load Required Libraries

import pandas as pd
import numpy as np

Explanation: Load Pandas and NumPy libraries for data manipulation.

2. Load the Dataset

df = pd.read_csv('/path_to_your_dataset.csv')

Explanation: Load your dataset into a DataFrame.

3. Display the First 10 Rows

df.head(10)

Explanation: Display the first 10 rows for an initial look at the data.

4. Dataset Structure

df.info()

Explanation: Get information on data types and missing values for each column.

5. Descriptive Statistics

df.describe()

Explanation: Get a summary of statistics for numerical columns.

Exploring Individual Columns

6. Unique Ranks

df['Rank'].nunique()

Explanation: Count unique values in the Rank column.

7. Movies with Duplicate Ranks

df[df.duplicated(['Rank'])]

Explanation: Check for duplicate ranks, if any.

8. Top 5 Movies by Rank

df.nsmallest(5, 'Rank')

Explanation: Show the top 5 ranked movies.

9. Unique Titles

df['Title'].unique()

Explanation: List all unique movie titles.

10. Movies with Duplicate Titles

df[df.duplicated(['Title'])]

Explanation: Check for duplicate movie titles, if any.

11. Unique Genres

df['Genre'].unique()

Explanation: List all unique genres in the dataset.

12. Genre Frequency

df['Genre'].value_counts()

Explanation: Count the frequency of each genre.

13. Top 3 Genres by Average Rating

df.groupby('Genre')['Rating'].mean().nlargest(3)

Explanation: Find genres with the highest average ratings.

14. Average Description Length

df['Description'].apply(len).mean()

Explanation: Calculate the average length of movie descriptions.

15. Top 5 Longest Descriptions

df.loc[df['Description'].apply(len).nlargest(5).index, ['Title', 'Description']]

Explanation: Find movies with the longest descriptions.

16. Unique Directors

df['Director'].nunique()

Explanation: Count unique directors in the dataset.

17. Top 5 Directors by Number of Movies

df['Director'].value_counts().head(5)

Explanation: Identify directors with the most movies in the dataset.

18. Directors with Highest Average Revenue

df.groupby('Director')['Revenue (Millions)'].mean().nlargest(5)

Explanation: Find directors with the highest average revenue.

19. Unique Actors

df['Actors'].nunique()

Explanation: Count unique actors in the dataset.

20. Most Frequent Actor Appearances

from collections import Counter
actor_counts = Counter(", ".join(df['Actors']).split(", "))
actor_counts.most_common(5)

Explanation: Identify the actors who appear most frequently.

21. Number of Movies by Year

df['Year'].value_counts().sort_index()

Explanation: See how many movies were released each year.

22. Average Rating by Year

df.groupby('Year')['Rating'].mean()

Explanation: Track average movie ratings over the years.

23. Movies Released in 2016

df[df['Year'] == 2016]

Explanation: List all movies released in a specific year.

24. Average Runtime of Movies

df['Runtime (Minutes)'].mean()

Explanation: Calculate the average runtime of movies.

25. Movies with Runtime Above 150 Minutes

df[df['Runtime (Minutes)'] > 150]

Explanation: List movies with a runtime longer than 150 minutes.

26. Distribution of Runtime

df['Runtime (Minutes)'].plot(kind='hist', title='Runtime Distribution')

Explanation: Plot a histogram to see the distribution of movie runtimes.

27. Average Rating of Movies

df['Rating'].mean()

Explanation: Find the average rating across all movies.

28. Top 10 Rated Movies

df.nlargest(10, 'Rating')[['Title', 'Rating']]

Explanation: Display the 10 highest-rated movies.

29. Rating Distribution

df['Rating'].plot(kind='hist', title='Rating Distribution')

Explanation: Plot a histogram to visualize the distribution of ratings.

30. Movies with Over 500,000 Votes

df[df['Votes'] > 500000]

Explanation: List movies with a high number of votes.

31. Correlation between Votes and Rating

df[['Votes', 'Rating']].corr()

Explanation: Calculate the correlation between votes and ratings.

32. Total Revenue in Millions

df['Revenue (Millions)'].sum()

Explanation: Calculate the total revenue for all movies.

33. Top 10 Movies by Revenue

df.nlargest(10, 'Revenue (Millions)')[['Title', 'Revenue (Millions)']]

Explanation: Display the top 10 highest revenue-generating movies.

34. Revenue Distribution

df['Revenue (Millions)'].plot(kind='hist', title='Revenue Distribution')

Explanation: Plot a histogram to visualize revenue distribution.

35. Average Metascore

df['Metascore'].mean()

Explanation: Find the average Metascore across all movies.

36. Top 5 Movies by Metascore

df.nlargest(5, 'Metascore')[['Title', 'Metascore']]

Explanation: Display the movies with the highest Metascore.

37. Scatter Plot: Revenue vs. Metascore

df.plot.scatter(x='Metascore', y='Revenue (Millions)', title='Revenue vs. Metascore')

Explanation: Create a scatter plot to observe the relationship between revenue and Metascore.

Advanced Analysis

38. Pivot Table: Average Rating per Genre and Year

df.pivot_table(values='Rating', index='Genre', columns='Year', aggfunc='mean')

Explanation: Create a pivot table showing average ratings by genre and year.

39. Movies with Genre "Sci-Fi"

sci_fi_movies = df[df['Genre'].str.contains('Sci-Fi')]
sci_fi_movies

Explanation: List all movies in the Sci-Fi genre.

40. Rating vs. Revenue Correlation

df[['Rating', 'Revenue (Millions)']].corr()

Explanation: Calculate the correlation between rating and revenue.

41. Find Outliers in Runtime

df['Runtime (Minutes)'].plot(kind='box', title='Runtime Outliers')

Explanation: Use a box plot to identify runtime outliers.

42. Directors with High Average Metascores

df.groupby('Director')['Metascore'].mean().nlargest(5)

Explanation: Find directors with the highest average Metascores.

43. Genre with Highest Average Votes

df.groupby('Genre')['Votes'].mean().nlargest(5)

Explanation: Find genres that received the highest average votes.

44. Bar Chart of Average Revenue by Genre

df.groupby('Genre')['Revenue (Millions)'].mean().sort_values().plot(kind='bar', title='Average Revenue by Genre')

Explanation: Plot average revenue per genre as a bar chart.

45. Revenue per Year

df.groupby('Year')['Revenue (Millions)'].sum().plot(kind='line', title='Total Revenue by Year')

Explanation: Plot total revenue over the years.

46. Movies per Decade

df['Decade'] = (df['Year'] // 10) * 10
df.groupby('Decade').size()

Explanation: Group movies by decade.

view raw jupyter_notebook_data_analysis_examples.md hosted with

by GitHub

Rajesh Kumar

I’m a DevOps/SRE/DevSecOps/Cloud Expert passionate about sharing knowledge and experiences. I am working at Cotocus. I blog tech insights at DevOps School, travel stories at Holiday Landmark, stock market tips at Stocks Mantra, health and fitness guidance at My Medic Plus, product reviews at I reviewed , and SEO strategies at Wizbrand.

Please find my social handles as below;

Rajesh Kumar Personal Website
Rajesh Kumar at YOUTUBE
Rajesh Kumar at INSTAGRAM
Rajesh Kumar at X
Rajesh Kumar at FACEBOOK
Rajesh Kumar at LINKEDIN
Rajesh Kumar at PINTEREST
Rajesh Kumar at QUORA
Rajesh Kumar at WIZBRAND

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!