Pandas cheat sheet
Purpose |
Code |
Use pandas in a code file |
import pandas as pd |
Read CSV |
df = pd.read('my_data_file.csv') |
Write to CSV |
df = pd.tocsv('my_new_data_file.csv') |
Read Excel |
df = pd.read_excel('my_data_file.xlsx', 'sheet1') |
Write to Excel |
df = pd.to_excel('my_new_data_file', sheet_name='sheet1') |
Skip n rows on read |
df = pd.read('my_data_file.csv', skiprows=1) |
Number of rows/cols |
df.shape |
Top n rows |
df.head(5) |
Column labels |
df.columns |
Column labels and data types |
df.info(verbose=True) |
Drop a named column |
df.drop(['Source for Re-opening'], axis=1) |
Drop numbered columns |
df.drop(df.columns[[40, 41]], axis=1) |
Find missing values |
df.isnull() and df.isna() |
Drop null values |
df.dropna() |
Replace missing values with fillna |
df.fillna({'Column A': 0, 'Column B': 99, 'Column C': df['Column C'].mean()}) , df.fillna(method='ffill', axis=1) , df.fillna(value=5) |
Replace missing values with a particular value e.g. mean |
mean = df['col1'].mean() then df['col1'].fillna(mean) |
Filter rows |
Rows where the value in col1 is greater than 10df[df['col1']>10] |
Modify and replace the contents of the dataframe |
Use inplace=True e.g. df = |
Find unique values |
df['col1'].unique() |
Number of occurrences of a value |
df['col1'].value_counts() |
Identify duplicates |
df.duplicated() |
Drop duplicates |
df.drop_duplicates(inplace=True) , df.drop_duplicates(subset=['col2'], inplace=True) , df.drop_duplicates(subset=['col2'], keep='last', inplace=True) |
Remove blank spaces |
df['col1'].str.strip() |
Merge multiple data frames |
pd.merge(df1, df2, on='name_of_common_column', how='outer' (options for how= include left, right and inner)` |
Concatenate multiple data frames |
python pd.concat([df1, df2]) |
Other functions not covered in the 'How to' guides but that you might find useful to investigate:
Purpose |
Code |
Splitting data into groups; applying functions to groups; and combining results into a data structure |
df.groupby('column1').sum() , df2 = df.groupby(['col1','col2']).count() |