Pandas 101: The Ultimate Data Science Companion | by Shaibal Das Gupta | May, 2023 | Medium

Shaibal Das Gupta
6 min readMay 1, 2023

--

A quick and easy guide for every data science enthusiast.

Credit: Bing Chat and Dall.E

Pandas module is a well-liked open-source tool for handling and analyzing data. It offers effective tools for manipulating tabular data, such as the ability to read and write data in a variety of formats, clean and transform data, choose and filter data based on a variety of criteria, aggregate and summarize data, and visualize data. This blog will tell you few capabilities about pandas. To explore more you can check out pandas official documentation.

Install Pandas

pip install pandas

Load File

Pandas supports different types of file formats including XML, HTML, JSON, XLSX, CSV, ZIP, TXT.

import pandas as pd
dataframe = pd.read_csv('filename.csv') # Loading data from a CSV file
dataframe = pd.read_excel('filename.xlsx') # Loading data from an Excel file
dataframe = pd.read_json('filename.json') # Loading data from a Json file

Viewing Data

Use the head method to view the first few rows of the Dataframe, and the tail method to view the last few rows. You will find out about each column’s types and other characteristics like non-null count using the info method and can learn the mean, min, max, standard deviation, and percentiles using the describe method. nlargest and nsmallest will help you to get top N rows from a specified column.

dataframe.head() # Displays the first 5 rows
dataframe.head(10) # Displays the first 10 rows
dataframe.tail() # Displays the last 5 rows
dataframe.tail(10) # Displays the last 10 rows
dataframe.nlargest(2, 'column_name') # Top n rows with the largest values
dataframe.nsmallest(2, 'column_name') # Top n rows with the smallest values
dataframe.info() # Displays the summary of the dataframe
dataframe.describe() # Generates descriptive statistics

To see the columns and index name you can use dataframe.columns and dataframe.index . dataframe['column_name'].value_counts() will give total count of each unique occurrence of a specified column. dataframe['column_name'].tolist() will list the column values.

Data Selection

If you want, you can choose one or more columns in pandas without selecting the entire Dataframe. You can select specific columns with loc and iloc methods. Label name must be mentioned while selecting data from a Dataframe using loc technique. Additionally, the integer-indexed based method iloc allows you to choose data based on the row and column indexes.

dataframe['column_name'] # Selecting a single column
dataframe[['column1', 'column2']] # Selecting multiple columns
dataframe.loc[row_index, 'column_name'] # Selecting a single cell using label indexing
dataframe.loc[dataframe['column_name'] == 'filter_value'] # Selecting a subset of Dataframe using label indexing
dataframe.loc[dataframe['column_name'] == 'filter_value', ['selected_column_name']] # Selecting a single column where condition matched
dataframe.iloc[row_index, column_index] # Selecting a single cell using integer indexing
dataframe.iloc[row_start:row_end, col_start:col_end] # Selecting multiple cells using integer indexing

Data Manipulation

Pandas’ magical methods make it simple to do several data manipulation operations. We can concatenate strings or do mathematical operations with ease. The drop and the drop_duplicates methods both assist in removing duplicate rows. drop and drop_duplicates also work with dropping based on filtering. When a match is detected, the replace function helps to update a column’s value, while the rename method helps to modify a column’s name. An optional attribute called inplace will update the Dataframe permanently if True. axis is used to indicate whether updates are made to columns or rows, with 0 denoting updates to rows.

dataframe['new_column'] = dataframe['column1'] + dataframe['column2'] # Creating a new column by adding existing columns
dataframe.drop(['column1', 'column2'], axis=1, inplace=True) # Dropping columns
dataframe.drop_duplicates(subset=['column1', 'column2'], inplace=True) # Dropping duplicates
dataframe.rename(columns={'old_name': 'new_name'}, inplace=True) # Renaming columns
dataframe.replace(to_replace='old_value', value='new_value', inplace=True) # Replacing values

You can use as many different conditions to filter Dataframe as necessary. It might be a method-oriented filter or a conditional operator. When a list of values is present, the isin method can be used to filter matching rows. ~ operator performs the not operation in pandas.

dataframe[dataframe['column_name'] > value] # Filtering data based on a condition
dataframe[dataframe['column_name'].isin(['value1', 'value2'])] # Filtering data based on a list of values
dataframe[(dataframe['column1'] > value) & (dataframe['column2'] < value)] # Filtering data based on multiple conditions

Data can be divided into groups based on one or more columns using the potent groupby method in pandas. Each group can then have a function applied to it, and the results can then be combined. This is frequently used when aggregating data and computing summary statistics for subsets of the data. With group by, we can also transform the data contained in each group, filter groups based on particular criteria, and apply unique functions to each group. The core tool for data analysis and modification in pandas is groupby. Here are some examples of using the groupby method in pandas:

# Grouping the DataFrame by the 'category' column and finding the mean of the 'value' column for each group
dataframe.groupby('category')['value'].mean()

# Grouping the DataFrame by the 'category' and 'date' columns and finding the sum of the 'value' column for each group
dataframe.groupby(['category', 'date'])['value'].sum()

# Grouping the DataFrame by the 'category' column and finding the mean, sum, and count of the 'value' column for each group
dataframe.groupby('category')['value'].agg(['mean', 'sum', 'count'])

You can also apply a custom method to each group of Dataframe.

# Defining a custom function to apply to each group
def custom_function(group):
# Do some operations on the group
return result

# Grouping the DataFrame by the 'category' column and applying the custom function to each group
dataframe.groupby('category').apply(custom_function)

You can sort Dataframe with its pre-built methods.

dataframe.sort_values('column_name', ascending=True) # Sorting data by a single column
dataframe.sort_values(['column1', 'column2'], ascending=[True, False]) # Sorting data by multiple columns
dataframe.sort_index(ascending=False) # Sorting data by index
dataframe.sort_index(key=lambda x: x.str.lower()) # Sorting data if index is in string type

Merge, Concat and Join with pandas can be done with a single line of code. merge and join work same except few parameters are set to default in join. join method performs a left join and joins the data frames on their index (or a specified column), and it fills in missing values with NaN. This function is a standalone function in pandas and is used to merge two data frames based on the values of one or more columns. It provides more control over the merge process than join. concat is a method in pandas that allows you to concatenate two or more data frames (or series) either along rows (axis=0) or columns (axis=1).

# Inner join
dataframe1.merge(dataframe2, on='column_name', how='inner')
dataframe1.join(dataframe2, how='inner', on='column_name')

# Left join
dataframe1.merge(dataframe2, on='column_name', how='left')
dataframe1.join(dataframe2, how='left', on='column_name')

# Right join
dataframe1.merge(dataframe2, on='column_name', how='right')
dataframe1.join(dataframe2, how='right', on='column_name')

# Full outer join
dataframe1.merge(dataframe2, on='column_name', how='outer')
dataframe1.join(dataframe2, how='outer', on='column_name')

# Combine two dataframes in either a horizontal or vertical direction
pd.concat([dataframe1, dataframe2], axis=0, join='outer', ignore_index=False)

With pandas, you can perform as many as different string operations, including lower, upper, strip, replace. Here are a few examples:

dataframe['column_name'].str.lower() # Converting text to lowercase
dataframe['column_name'].str.upper() # Converting text to uppercase
dataframe['column_name'].str.strip() # Removing whitespace from text
dataframe['column_name'].str.replace('old_value', 'new_value') # Replacing text
dataframe['column_name'].str.split('delimiter') # Splitting text into multiple columns
dataframe['column_name'].str.contains('matched_string', regex=False) # Finds match with matched string
dataframe['column_name'].str.extract('(\w+)', expand=False) # Extracts substrings from a string
dataframe['column_name'].str.len() # Length of the string
dataframe['column_name'].str.startswith('matched_string') # Checks for string which starts with matched string
dataframe['column_name'].str.endswith('matched_string') # Checks for string which end with matched string
dataframe['column_name1'].str.cat(dataframe['column_name2'], sep=' - ') # Concat two columns with seperator

Reshaping Data

Reshaping data in Pandas is the process of changing a Dataframe’s shape, such as through stacking, melting, or pivoting. This can help with data organization, analysis, and preparation for machine learning or visualization. By pivoting on one or more columns, the pivot technique transforms a Dataframe from long to wide format. The melt method pivots on one or more columns to resize a Dataframe from wide to long format. A Dataframe’s columns are stacked together into one column using the stack technique.

dataframe.pivot_table(values='value', index='index_column', columns='column_name') # Creating a pivot table
pd.melt(dataframe, id_vars='id_column', value_vars=['column1', 'column2']) # Melting data
dataframe.set_index(['column_name1', 'column_name2']).stack().reset_index(name='Value') # Stacking the columns into single column

Handling Time Series Data

A time series is a collection of data that is kept track of throughout time and is widely used in fields like finance, economics, and weather forecasting. Working with time series data is made straightforward by pandas’ built-in features for doing so. Several typical methods are:

dataframe['date_column'] = pd.to_datetime(dataframe['date_column']) # Converting a column to a datetime object
dataframe.set_index('date_column', inplace=True) # Setting the index to a datetime object
dataframe.resample('D').mean() # Resampling data by day and calculating the mean

In conclusion, pandas is a very useful and versatile Python package for data processing and analysis. It includes a variety of effective capabilities for working with tabular data, including as reading and writing data in different formats, cleaning and transforming data, selecting and filtering data, aggregating and summarizing data, and visualizing data. Pandas has been a popular choice for data scientists, analysts, and researchers in a variety of sectors because to its broad set of features, and it continues to expand and improve with each new release. If you work with data in Python, you should be familiar with the pandas library.

--

--

Shaibal Das Gupta

W3 Engineers Ltd. | MLOps | Vector Databases | Rag and LLM Agents | Big Data | Data Science | Machine Learning | Analytics | Google Marketing | AWS | ETL