🚀 DevOps & SRE Certification Program 📅 Starting: 1st of Every Month 🤝 +91 8409492687 🔍 Contact@DevOpsSchool.com

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!

We spend hours on Instagram and YouTube and waste money on coffee and fast food, but won’t spend 30 minutes a day learning skills to boost our careers.
Master in DevOps, SRE, DevSecOps & MLOps!

Learn from Guru Rajesh Kumar and double your salary in just one year.


Get Started Now!

Jupyter notebook – Lab Session – 1 – Exploring Dataset with Pandas and NumPy

Importing Libraries

import pandas as pd
import numpy as np

Explanation: Import the essential libraries.

Loading the Dataset

df = pd.read_csv('/path_to_your_dataset.csv')

Explanation: Load the dataset into a Pandas DataFrame.

Display First Few Rows

df.head()

Explanation: Display the first five rows to understand the structure.

Display Last Few Rows

df.tail()

Explanation: Display the last five rows of the dataset.

Dataset Information

df.info()

Explanation: Get an overview, including data types and null values.

Descriptive Statistics

df.describe()

Explanation: Get statistics like mean, median, min, and max for each column.

Column Names

df.columns

Explanation: List all column names in the dataset.

Shape of the Dataset

df.shape

Explanation: Get the number of rows and columns.

Check for Null Values

df.isnull().sum()

Explanation: Count null values in each column.

Drop Rows with Null Values

df_cleaned = df.dropna()

Explanation: Remove rows with null values for a cleaner dataset.

Fill Null Values

df.fillna(value='Unknown', inplace=True)

Explanation: Fill null values with a placeholder.

Unique Values in a Column

df['column_name'].unique()

Explanation: Display unique values in a specific column.

Value Counts

df['column_name'].value_counts()

Explanation: Count the occurrences of each unique value in a column.

Filter Rows by Condition

df_filtered = df[df['column_name'] > some_value]

Explanation: Filter rows based on a condition.

Selecting Multiple Columns

df[['column1', 'column2']]

Explanation: Select and display specific columns.

Add a New Column

df['new_column'] = df['column1'] + df['column2']

Explanation: Add a new column by combining values from other columns.

Rename Columns

df.rename(columns={'old_name': 'new_name'}, inplace=True)

Explanation: Rename columns for better readability.

Sorting Values

df.sort_values(by='column_name', ascending=False)

Explanation: Sort the dataset by a specific column.

Drop a Column

df.drop('column_name', axis=1, inplace=True)

Explanation: Remove a specific column.

Group By and Aggregate

df.groupby('column_name').sum()

Explanation: Group by a column and apply an aggregate function like sum.

Calculate Mean of a Column

df['column_name'].mean()

Explanation: Calculate the mean of a specific column.

Calculate Median of a Column

df['column_name'].median()

Explanation: Calculate the median of a specific column.

Standard Deviation of a Column

df['column_name'].std()

Explanation: Calculate the standard deviation of a specific column.

Detecting Outliers

df[(df['column_name'] > upper_limit) | (df['column_name'] < lower_limit)]

Explanation: Detect outliers by specifying upper and lower limits.

Apply Custom Function

df['new_column'] = df['column_name'].apply(lambda x: x * 2)

Explanation: Apply a custom function to each value in a column.

Pivot Table

df.pivot_table(values='value_column', index='index_column', columns='column_name')

Explanation: Create a pivot table to analyze relationships.

Correlation Matrix

df.corr()

Explanation: Calculate the correlation matrix for numeric columns.

Visualizing with Histograms

df['column_name'].hist()

Explanation: Plot a histogram for a column to view the distribution.

Scatter Plot

df.plot.scatter(x='column_x', y='column_y')

Explanation: Create a scatter plot to see relationships between two columns.

Box Plot

df.boxplot(column='column_name')

Explanation: Generate a box plot to identify the spread and outliers.

Live Example of Data set Attached

DOWNLOAD from HERE – CLICK HERE

Exploring Dataset with Pandas and NumPy in Jupyter Notebook

1. Load Required Libraries

import pandas as pd
import numpy as np

Explanation: Load Pandas and NumPy libraries for data manipulation.

2. Load the Dataset

df = pd.read_csv('/path_to_your_dataset.csv')

Explanation: Load your dataset into a DataFrame.

3. Display the First 10 Rows

df.head(10)

Explanation: Display the first 10 rows for an initial look at the data.

4. Dataset Structure

df.info()

Explanation: Get information on data types and missing values for each column.

5. Descriptive Statistics

df.describe()

Explanation: Get a summary of statistics for numerical columns.


Exploring Individual Columns

6. Unique Ranks

df['Rank'].nunique()

Explanation: Count unique values in the Rank column.

7. Movies with Duplicate Ranks

df[df.duplicated(['Rank'])]

Explanation: Check for duplicate ranks, if any.

8. Top 5 Movies by Rank

df.nsmallest(5, 'Rank')

Explanation: Show the top 5 ranked movies.


9. Unique Titles

df['Title'].unique()

Explanation: List all unique movie titles.

10. Movies with Duplicate Titles

df[df.duplicated(['Title'])]

Explanation: Check for duplicate movie titles, if any.


11. Unique Genres

df['Genre'].unique()

Explanation: List all unique genres in the dataset.

12. Genre Frequency

df['Genre'].value_counts()

Explanation: Count the frequency of each genre.

13. Top 3 Genres by Average Rating

df.groupby('Genre')['Rating'].mean().nlargest(3)

Explanation: Find genres with the highest average ratings.


14. Average Description Length

df['Description'].apply(len).mean()

Explanation: Calculate the average length of movie descriptions.

15. Top 5 Longest Descriptions

df.loc[df['Description'].apply(len).nlargest(5).index, ['Title', 'Description']]

Explanation: Find movies with the longest descriptions.


16. Unique Directors

df['Director'].nunique()

Explanation: Count unique directors in the dataset.

17. Top 5 Directors by Number of Movies

df['Director'].value_counts().head(5)

Explanation: Identify directors with the most movies in the dataset.

18. Directors with Highest Average Revenue

df.groupby('Director')['Revenue (Millions)'].mean().nlargest(5)

Explanation: Find directors with the highest average revenue.


19. Unique Actors

df['Actors'].nunique()

Explanation: Count unique actors in the dataset.

20. Most Frequent Actor Appearances

from collections import Counter
actor_counts = Counter(", ".join(df['Actors']).split(", "))
actor_counts.most_common(5)

Explanation: Identify the actors who appear most frequently.


21. Number of Movies by Year

df['Year'].value_counts().sort_index()

Explanation: See how many movies were released each year.

22. Average Rating by Year

df.groupby('Year')['Rating'].mean()

Explanation: Track average movie ratings over the years.

23. Movies Released in 2016

df[df['Year'] == 2016]

Explanation: List all movies released in a specific year.


24. Average Runtime of Movies

df['Runtime (Minutes)'].mean()

Explanation: Calculate the average runtime of movies.

25. Movies with Runtime Above 150 Minutes

df[df['Runtime (Minutes)'] > 150]

Explanation: List movies with a runtime longer than 150 minutes.

26. Distribution of Runtime

df['Runtime (Minutes)'].plot(kind='hist', title='Runtime Distribution')

Explanation: Plot a histogram to see the distribution of movie runtimes.


27. Average Rating of Movies

df['Rating'].mean()

Explanation: Find the average rating across all movies.

28. Top 10 Rated Movies

df.nlargest(10, 'Rating')[['Title', 'Rating']]

Explanation: Display the 10 highest-rated movies.

29. Rating Distribution

df['Rating'].plot(kind='hist', title='Rating Distribution')

Explanation: Plot a histogram to visualize the distribution of ratings.


30. Movies with Over 500,000 Votes

df[df['Votes'] > 500000]

Explanation: List movies with a high number of votes.

31. Correlation between Votes and Rating

df[['Votes', 'Rating']].corr()

Explanation: Calculate the correlation between votes and ratings.


32. Total Revenue in Millions

df['Revenue (Millions)'].sum()

Explanation: Calculate the total revenue for all movies.

33. Top 10 Movies by Revenue

df.nlargest(10, 'Revenue (Millions)')[['Title', 'Revenue (Millions)']]

Explanation: Display the top 10 highest revenue-generating movies.

34. Revenue Distribution

df['Revenue (Millions)'].plot(kind='hist', title='Revenue Distribution')

Explanation: Plot a histogram to visualize revenue distribution.


35. Average Metascore

df['Metascore'].mean()

Explanation: Find the average Metascore across all movies.

36. Top 5 Movies by Metascore

df.nlargest(5, 'Metascore')[['Title', 'Metascore']]

Explanation: Display the movies with the highest Metascore.

37. Scatter Plot: Revenue vs. Metascore

df.plot.scatter(x='Metascore', y='Revenue (Millions)', title='Revenue vs. Metascore')

Explanation: Create a scatter plot to observe the relationship between revenue and Metascore.


Advanced Analysis

38. Pivot Table: Average Rating per Genre and Year

df.pivot_table(values='Rating', index='Genre', columns='Year', aggfunc='mean')

Explanation: Create a pivot table showing average ratings by genre and year.

39. Movies with Genre "Sci-Fi"

sci_fi_movies = df[df['Genre'].str.contains('Sci-Fi')]
sci_fi_movies

Explanation: List all movies in the Sci-Fi genre.

40. Rating vs. Revenue Correlation

df[['Rating', 'Revenue (Millions)']].corr()

Explanation: Calculate the correlation between rating and revenue.

41. Find Outliers in Runtime

df['Runtime (Minutes)'].plot(kind='box', title='Runtime Outliers')

Explanation: Use a box plot to identify runtime outliers.

42. Directors with High Average Metascores

df.groupby('Director')['Metascore'].mean().nlargest(5)

Explanation: Find directors with the highest average Metascores.

43. Genre with Highest Average Votes

df.groupby('Genre')['Votes'].mean().nlargest(5)

Explanation: Find genres that received the highest average votes.

44. Bar Chart of Average Revenue by Genre

df.groupby('Genre')['Revenue (Millions)'].mean().sort_values().plot(kind='bar', title='Average Revenue by Genre')

Explanation: Plot average revenue per genre as a bar chart.

45. Revenue per Year

df.groupby('Year')['Revenue (Millions)'].sum().plot(kind='line', title='Total Revenue by Year')

Explanation: Plot total revenue over the years.

46. Movies per Decade

df['Decade'] = (df['Year'] // 10) * 10
df.groupby('Decade').size()

Explanation: Group movies by decade.

Subscribe
Notify of
guest


0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments

Certification Courses

DevOpsSchool has introduced a series of professional certification courses designed to enhance your skills and expertise in cutting-edge technologies and methodologies. Whether you are aiming to excel in development, security, or operations, these certifications provide a comprehensive learning experience. Explore the following programs:

DevOps Certification, SRE Certification, and DevSecOps Certification by DevOpsSchool

Explore our DevOps Certification, SRE Certification, and DevSecOps Certification programs at DevOpsSchool. Gain the expertise needed to excel in your career with hands-on training and globally recognized certifications.

0
Would love your thoughts, please comment.x
()
x