Pandas 101: The Ultimate Data Science Companion | by Shaibal Das Gupta | May, 2023 | Medium
A quick and easy guide for every data science enthusiast.
Pandas module is a well-liked open-source tool for handling and analyzing data. It offers effective tools for manipulating tabular data, such as the ability to read and write data in a variety of formats, clean and transform data, choose and filter data based on a variety of criteria, aggregate and summarize data, and visualize data. This blog will tell you few capabilities about pandas. To explore more you can check out pandas official documentation.
Install Pandas
pip install pandas
Load File
Pandas supports different types of file formats including XML, HTML, JSON, XLSX, CSV, ZIP, TXT.
import pandas as pd
dataframe = pd.read_csv('filename.csv') # Loading data from a CSV file
dataframe = pd.read_excel('filename.xlsx') # Loading data from an Excel file
dataframe = pd.read_json('filename.json') # Loading data from a Json file
Viewing Data
Use the head
method to view the first few rows of the Dataframe, and the tail
method to view the last few rows. You will find out about each column’s types and other characteristics like non-null count using the info
method and can learn the mean, min, max, standard deviation, and percentiles using the describe
method. nlargest
and nsmallest
will help you to get top N rows from a specified column.
dataframe.head() # Displays the first 5 rows
dataframe.head(10) # Displays the first 10 rows
dataframe.tail() # Displays the last 5 rows
dataframe.tail(10) # Displays the last 10 rows
dataframe.nlargest(2, 'column_name') # Top n rows with the largest values
dataframe.nsmallest(2, 'column_name') # Top n rows with the smallest values
dataframe.info() # Displays the summary of the dataframe
dataframe.describe() # Generates descriptive statistics
To see the columns and index name you can use dataframe.columns
and dataframe.index
. dataframe['column_name'].value_counts()
will give total count of each unique occurrence of a specified column. dataframe['column_name'].tolist()
will list the column values.
Data Selection
If you want, you can choose one or more columns in pandas without selecting the entire Dataframe. You can select specific columns with loc
and iloc
methods. Label name must be mentioned while selecting data from a Dataframe using loc
technique. Additionally, the integer-indexed based method iloc
allows you to choose data based on the row and column indexes.
dataframe['column_name'] # Selecting a single column
dataframe[['column1', 'column2']] # Selecting multiple columns
dataframe.loc[row_index, 'column_name'] # Selecting a single cell using label indexing
dataframe.loc[dataframe['column_name'] == 'filter_value'] # Selecting a subset of Dataframe using label indexing
dataframe.loc[dataframe['column_name'] == 'filter_value', ['selected_column_name']] # Selecting a single column where condition matched
dataframe.iloc[row_index, column_index] # Selecting a single cell using integer indexing
dataframe.iloc[row_start:row_end, col_start:col_end] # Selecting multiple cells using integer indexing
Data Manipulation
Pandas’ magical methods make it simple to do several data manipulation operations. We can concatenate strings or do mathematical operations with ease. The drop
and the drop_duplicates
methods both assist in removing duplicate rows. drop
and drop_duplicates
also work with dropping based on filtering. When a match is detected, the replace
function helps to update a column’s value, while the rename
method helps to modify a column’s name. An optional attribute called inplace
will update the Dataframe permanently if True
. axis
is used to indicate whether updates are made to columns or rows, with 0 denoting updates to rows.
dataframe['new_column'] = dataframe['column1'] + dataframe['column2'] # Creating a new column by adding existing columns
dataframe.drop(['column1', 'column2'], axis=1, inplace=True) # Dropping columns
dataframe.drop_duplicates(subset=['column1', 'column2'], inplace=True) # Dropping duplicates
dataframe.rename(columns={'old_name': 'new_name'}, inplace=True) # Renaming columns
dataframe.replace(to_replace='old_value', value='new_value', inplace=True) # Replacing values
You can use as many different conditions to filter Dataframe as necessary. It might be a method-oriented filter or a conditional operator. When a list of values is present, the isin
method can be used to filter matching rows. ~
operator performs the not operation in pandas.
dataframe[dataframe['column_name'] > value] # Filtering data based on a condition
dataframe[dataframe['column_name'].isin(['value1', 'value2'])] # Filtering data based on a list of values
dataframe[(dataframe['column1'] > value) & (dataframe['column2'] < value)] # Filtering data based on multiple conditions
Data can be divided into groups based on one or more columns using the potent groupby
method in pandas. Each group can then have a function applied to it, and the results can then be combined. This is frequently used when aggregating data and computing summary statistics for subsets of the data. With group by, we can also transform the data contained in each group, filter groups based on particular criteria, and apply unique functions to each group. The core tool for data analysis and modification in pandas is groupby
. Here are some examples of using the groupby
method in pandas:
# Grouping the DataFrame by the 'category' column and finding the mean of the 'value' column for each group
dataframe.groupby('category')['value'].mean()
# Grouping the DataFrame by the 'category' and 'date' columns and finding the sum of the 'value' column for each group
dataframe.groupby(['category', 'date'])['value'].sum()
# Grouping the DataFrame by the 'category' column and finding the mean, sum, and count of the 'value' column for each group
dataframe.groupby('category')['value'].agg(['mean', 'sum', 'count'])
You can also apply a custom method to each group of Dataframe.
# Defining a custom function to apply to each group
def custom_function(group):
# Do some operations on the group
return result
# Grouping the DataFrame by the 'category' column and applying the custom function to each group
dataframe.groupby('category').apply(custom_function)
You can sort Dataframe with its pre-built methods.
dataframe.sort_values('column_name', ascending=True) # Sorting data by a single column
dataframe.sort_values(['column1', 'column2'], ascending=[True, False]) # Sorting data by multiple columns
dataframe.sort_index(ascending=False) # Sorting data by index
dataframe.sort_index(key=lambda x: x.str.lower()) # Sorting data if index is in string type
Merge, Concat and Join with pandas can be done with a single line of code. merge
and join
work same except few parameters are set to default in join
. join
method performs a left join and joins the data frames on their index (or a specified column), and it fills in missing values with NaN. This function is a standalone function in pandas and is used to merge two data frames based on the values of one or more columns. It provides more control over the merge process than join
. concat
is a method in pandas that allows you to concatenate two or more data frames (or series) either along rows (axis=0) or columns (axis=1).
# Inner join
dataframe1.merge(dataframe2, on='column_name', how='inner')
dataframe1.join(dataframe2, how='inner', on='column_name')
# Left join
dataframe1.merge(dataframe2, on='column_name', how='left')
dataframe1.join(dataframe2, how='left', on='column_name')
# Right join
dataframe1.merge(dataframe2, on='column_name', how='right')
dataframe1.join(dataframe2, how='right', on='column_name')
# Full outer join
dataframe1.merge(dataframe2, on='column_name', how='outer')
dataframe1.join(dataframe2, how='outer', on='column_name')
# Combine two dataframes in either a horizontal or vertical direction
pd.concat([dataframe1, dataframe2], axis=0, join='outer', ignore_index=False)
With pandas, you can perform as many as different string operations, including lower
, upper
, strip
, replace
. Here are a few examples:
dataframe['column_name'].str.lower() # Converting text to lowercase
dataframe['column_name'].str.upper() # Converting text to uppercase
dataframe['column_name'].str.strip() # Removing whitespace from text
dataframe['column_name'].str.replace('old_value', 'new_value') # Replacing text
dataframe['column_name'].str.split('delimiter') # Splitting text into multiple columns
dataframe['column_name'].str.contains('matched_string', regex=False) # Finds match with matched string
dataframe['column_name'].str.extract('(\w+)', expand=False) # Extracts substrings from a string
dataframe['column_name'].str.len() # Length of the string
dataframe['column_name'].str.startswith('matched_string') # Checks for string which starts with matched string
dataframe['column_name'].str.endswith('matched_string') # Checks for string which end with matched string
dataframe['column_name1'].str.cat(dataframe['column_name2'], sep=' - ') # Concat two columns with seperator
Reshaping Data
Reshaping data in Pandas is the process of changing a Dataframe’s shape, such as through stacking, melting, or pivoting. This can help with data organization, analysis, and preparation for machine learning or visualization. By pivoting on one or more columns, the pivot
technique transforms a Dataframe from long to wide format. The melt
method pivots on one or more columns to resize a Dataframe from wide to long format. A Dataframe’s columns are stacked together into one column using the stack
technique.
dataframe.pivot_table(values='value', index='index_column', columns='column_name') # Creating a pivot table
pd.melt(dataframe, id_vars='id_column', value_vars=['column1', 'column2']) # Melting data
dataframe.set_index(['column_name1', 'column_name2']).stack().reset_index(name='Value') # Stacking the columns into single column
Handling Time Series Data
A time series is a collection of data that is kept track of throughout time and is widely used in fields like finance, economics, and weather forecasting. Working with time series data is made straightforward by pandas’ built-in features for doing so. Several typical methods are:
dataframe['date_column'] = pd.to_datetime(dataframe['date_column']) # Converting a column to a datetime object
dataframe.set_index('date_column', inplace=True) # Setting the index to a datetime object
dataframe.resample('D').mean() # Resampling data by day and calculating the mean
In conclusion, pandas is a very useful and versatile Python package for data processing and analysis. It includes a variety of effective capabilities for working with tabular data, including as reading and writing data in different formats, cleaning and transforming data, selecting and filtering data, aggregating and summarizing data, and visualizing data. Pandas has been a popular choice for data scientists, analysts, and researchers in a variety of sectors because to its broad set of features, and it continues to expand and improve with each new release. If you work with data in Python, you should be familiar with the pandas library.