Pandas DataFrames#
pandas
is a very popular Python library, which allows data to be loaded as a dataframe.
The library allows many operations to be easily performed on the data, such as calculating the mean or the sum of a column, transpose the dataframe, aggregate values and so on. Moreover, it is widely used for statistical data analysis.
It is very well documented, more information and tutorials about the package can be found here.
For simplicity, the convention is to import pandas
as pd
, but in principle, it can be called anything, although it is always a good idea to stick to the accepted conventions to ensure code readability.
pandas
allow data to be loaded from many different formats and has many functions implemented for this purpose.
Here, we will load a .csv file, but you can load Excel, txt, JSON data as well as many other formats.
1. Loading and inspecting data#
import pandas as pd
# if you have problems loading the data, provide the full path
# add the path to where you saved your packed_column.csv here
# the path should look like this: C:/Users/my_user/Desktop/week02/theory/
#path =
#packed_column = pd.read_csv(f'{path}packed_column.csv', encoding='utf-8', sep=';')
# use read_csv to lead data from .csv file
# here you can see that we provide the encoding and a separator
packed_column = pd.read_csv('packed_column.csv', encoding='utf-8', sep=';')
packed_column
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
0 | small | 15 | 1 | 2 | 3 | 3 |
1 | small | 30 | 2 | 2 | 5 | 3 |
2 | small | 50 | 2 | 3 | 6 | 5 |
3 | small | 80 | 3 | 4 | 9 | 9 |
4 | small | 100 | 3 | 5 | 13 | 13 |
5 | big | 40 | 5 | 9 | 14 | 25 |
6 | big | 50 | 7 | 14 | 22 | 39 |
7 | big | 60 | 10 | 19 | 32 | 161 |
8 | big | 70 | 13 | 27 | 46 | 224 |
9 | big | 80 | 15 | 34 | 57 | 0 |
In the cell above we loaded our DataFrame from a csv file, but what is actually a DataFrame? It is basically just a matrix (or table, if you prefer thinking about Excel tables) with rows and columns.
Let’s now see some pandas
operations
# inspect the top 5 elements in the dataframe
packed_column.head()
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
0 | small | 15 | 1 | 2 | 3 | 3 |
1 | small | 30 | 2 | 2 | 5 | 3 |
2 | small | 50 | 2 | 3 | 6 | 5 |
3 | small | 80 | 3 | 4 | 9 | 9 |
4 | small | 100 | 3 | 5 | 13 | 13 |
# inspect the top 10 elements in the dataframe
packed_column.head(10)
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
0 | small | 15 | 1 | 2 | 3 | 3 |
1 | small | 30 | 2 | 2 | 5 | 3 |
2 | small | 50 | 2 | 3 | 6 | 5 |
3 | small | 80 | 3 | 4 | 9 | 9 |
4 | small | 100 | 3 | 5 | 13 | 13 |
5 | big | 40 | 5 | 9 | 14 | 25 |
6 | big | 50 | 7 | 14 | 22 | 39 |
7 | big | 60 | 10 | 19 | 32 | 161 |
8 | big | 70 | 13 | 27 | 46 | 224 |
9 | big | 80 | 15 | 34 | 57 | 0 |
# inspect the bottom 5 elements in the dataframe
packed_column.tail()
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
5 | big | 40 | 5 | 9 | 14 | 25 |
6 | big | 50 | 7 | 14 | 22 | 39 |
7 | big | 60 | 10 | 19 | 32 | 161 |
8 | big | 70 | 13 | 27 | 46 | 224 |
9 | big | 80 | 15 | 34 | 57 | 0 |
# print the columns
packed_column.columns
Index(['flow type', 'water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h',
'300 kg/h'],
dtype='object')
# calculate the mean of each column
packed_column[['water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h', '300 kg/h']].mean()
water flow / air flow 57.5
0 kg/h 6.1
100 kg/h 11.9
200 kg/h 20.7
300 kg/h 48.2
dtype: float64
packed_column[['water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h', '300 kg/h']]
water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|
0 | 15 | 1 | 2 | 3 | 3 |
1 | 30 | 2 | 2 | 5 | 3 |
2 | 50 | 2 | 3 | 6 | 5 |
3 | 80 | 3 | 4 | 9 | 9 |
4 | 100 | 3 | 5 | 13 | 13 |
5 | 40 | 5 | 9 | 14 | 25 |
6 | 50 | 7 | 14 | 22 | 39 |
7 | 60 | 10 | 19 | 32 | 161 |
8 | 70 | 13 | 27 | 46 | 224 |
9 | 80 | 15 | 34 | 57 | 0 |
# calculate the sum of each column
packed_column.sum()
flow type smallsmallsmallsmallsmallbigbigbigbigbig
water flow / air flow 575
0 kg/h 61
100 kg/h 119
200 kg/h 207
300 kg/h 482
dtype: object
Note: describe
is a very useful function to remember!
# 'describe' returns many useful statistics about the data
packed_column.describe()
water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|
count | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 |
mean | 57.500000 | 6.100000 | 11.900000 | 20.700000 | 48.200000 |
std | 25.739075 | 4.976612 | 11.376877 | 18.571484 | 78.389058 |
min | 15.000000 | 1.000000 | 2.000000 | 3.000000 | 0.000000 |
25% | 42.500000 | 2.250000 | 3.250000 | 6.750000 | 3.500000 |
50% | 55.000000 | 4.000000 | 7.000000 | 13.500000 | 11.000000 |
75% | 77.500000 | 9.250000 | 17.750000 | 29.500000 | 35.500000 |
max | 100.000000 | 15.000000 | 34.000000 | 57.000000 | 224.000000 |
# transpose a dataframe
packed_column.T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
flow type | small | small | small | small | small | big | big | big | big | big |
water flow / air flow | 15 | 30 | 50 | 80 | 100 | 40 | 50 | 60 | 70 | 80 |
0 kg/h | 1 | 2 | 2 | 3 | 3 | 5 | 7 | 10 | 13 | 15 |
100 kg/h | 2 | 2 | 3 | 4 | 5 | 9 | 14 | 19 | 27 | 34 |
200 kg/h | 3 | 5 | 6 | 9 | 13 | 14 | 22 | 32 | 46 | 57 |
300 kg/h | 3 | 3 | 5 | 9 | 13 | 25 | 39 | 161 | 224 | 0 |
# summing the rows (removing the categorical row) of the transposed dataframe
packed_column.T[1:].sum()
0 24
1 42
2 66
3 105
4 134
5 93
6 132
7 282
8 380
9 186
dtype: object
2. Indexing#
It is also possible to index a DataFrame to update a value or select a subset of rows and columns we want to work with.
We will use the iloc
function for this purpose.
# replace the value of row 3 column 4 with None
# remember that in Python we start counting from 0!
packed_column.iloc[3, 4] = None
packed_column
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
0 | small | 15 | 1 | 2 | 3.0 | 3 |
1 | small | 30 | 2 | 2 | 5.0 | 3 |
2 | small | 50 | 2 | 3 | 6.0 | 5 |
3 | small | 80 | 3 | 4 | NaN | 9 |
4 | small | 100 | 3 | 5 | 13.0 | 13 |
5 | big | 40 | 5 | 9 | 14.0 | 25 |
6 | big | 50 | 7 | 14 | 22.0 | 39 |
7 | big | 60 | 10 | 19 | 32.0 | 161 |
8 | big | 70 | 13 | 27 | 46.0 | 224 |
9 | big | 80 | 15 | 34 | 57.0 | 0 |
# select the first row
packed_column.iloc[0]
flow type small
water flow / air flow 15
0 kg/h 1
100 kg/h 2
200 kg/h 3.0
300 kg/h 3
Name: 0, dtype: object
# select a specific column
packed_column['water flow / air flow']
0 15
1 30
2 50
3 80
4 100
5 40
6 50
7 60
8 70
9 80
Name: water flow / air flow, dtype: int64
3. Replacing missing values#
A very important data pre-processing task is imputing missing values. Depending on the data available and the task to perform, different strategies might be more suitable.
The most common strategies are:
replacing the missing values with a specific value (often 0, but it can be any value)
replacing the missing values with the mean of the columns
dropping (discarding) the rows with missing values.
# fill missing values with 0
packed_column.fillna(0)
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
0 | small | 15 | 1 | 2 | 3.0 | 3 |
1 | small | 30 | 2 | 2 | 5.0 | 3 |
2 | small | 50 | 2 | 3 | 6.0 | 5 |
3 | small | 80 | 3 | 4 | 0.0 | 9 |
4 | small | 100 | 3 | 5 | 13.0 | 13 |
5 | big | 40 | 5 | 9 | 14.0 | 25 |
6 | big | 50 | 7 | 14 | 22.0 | 39 |
7 | big | 60 | 10 | 19 | 32.0 | 161 |
8 | big | 70 | 13 | 27 | 46.0 | 224 |
9 | big | 80 | 15 | 34 | 57.0 | 0 |
# dropping missing values
packed_column.dropna()
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
0 | small | 15 | 1 | 2 | 3.0 | 3 |
1 | small | 30 | 2 | 2 | 5.0 | 3 |
2 | small | 50 | 2 | 3 | 6.0 | 5 |
4 | small | 100 | 3 | 5 | 13.0 | 13 |
5 | big | 40 | 5 | 9 | 14.0 | 25 |
6 | big | 50 | 7 | 14 | 22.0 | 39 |
7 | big | 60 | 10 | 19 | 32.0 | 161 |
8 | big | 70 | 13 | 27 | 46.0 | 224 |
9 | big | 80 | 15 | 34 | 57.0 | 0 |
# fill missing values with the mean of a specific column
packed_column.fillna(packed_column['200 kg/h'].mean())
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
0 | small | 15 | 1 | 2 | 3.0 | 3 |
1 | small | 30 | 2 | 2 | 5.0 | 3 |
2 | small | 50 | 2 | 3 | 6.0 | 5 |
3 | small | 80 | 3 | 4 | 22.0 | 9 |
4 | small | 100 | 3 | 5 | 13.0 | 13 |
5 | big | 40 | 5 | 9 | 14.0 | 25 |
6 | big | 50 | 7 | 14 | 22.0 | 39 |
7 | big | 60 | 10 | 19 | 32.0 | 161 |
8 | big | 70 | 13 | 27 | 46.0 | 224 |
9 | big | 80 | 15 | 34 | 57.0 | 0 |
4. Creating DataFrames from lists and dictionaries#
DataFrames can be created from many different objects: they can be loaded from saved files, as we saw before, but they can also be created from lists and dictionaries, for example.
# create a df from lists
df = pd.DataFrame()
df['column 1'] = list(range(10))
df['column 2'] = list(range(2, 21, 2))
df.head()
column 1 | column 2 | |
---|---|---|
0 | 0 | 2 |
1 | 1 | 4 |
2 | 2 | 6 |
3 | 3 | 8 |
4 | 4 | 10 |
# create a df from a list of lists
observations = [[3, 57, 9], [1, 56, 3], [3, 47, 0]]
observations_df = pd.DataFrame(observations)
observations_df
0 | 1 | 2 | |
---|---|---|---|
0 | 3 | 57 | 9 |
1 | 1 | 56 | 3 |
2 | 3 | 47 | 0 |
Note that you usually want your rows and column names to be descriptive (so not just using the index of the row/column), this will also help you index and perform other operations.
# create a df from a dictionary
dict_conditions = {
"temperature": 30,
"pH": 7.0,
"pressure": 1.0
}
df = pd.DataFrame.from_dict(dict_conditions, orient='index')
df
0 | |
---|---|
temperature | 30.0 |
pH | 7.0 |
pressure | 1.0 |
# you can reset the index to have your keys as a column instead of the index
df = df.reset_index()
df
index | 0 | |
---|---|---|
0 | temperature | 30.0 |
1 | pH | 7.0 |
2 | pressure | 1.0 |
5. Groupby#
A very useful that you can do in pandas is to aggregate by a certain value across the dataframe and add rows that present the same value according to a specified logic. Read the official documentation here
# group dataframe by the flow type and calculate the average water flow
packed_column.groupby(['flow type']).mean()
water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|
flow type | |||||
big | 60.0 | 10.0 | 20.6 | 34.20 | 89.8 |
small | 55.0 | 2.2 | 3.2 | 6.75 | 6.6 |
We can of course save the new aggregated dataframe.
aggr_packed_column = packed_column.groupby(['flow type']).mean()
aggr_packed_column
water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|
flow type | |||||
big | 60.0 | 10.0 | 20.6 | 34.20 | 89.8 |
small | 55.0 | 2.2 | 3.2 | 6.75 | 6.6 |
What if we want different metrics in different columns?
aggr_packed_column = packed_column.groupby(['flow type']).agg({'water flow / air flow': 'sum',
'0 kg/h': list,
'100 kg/h': 'mean',
'200 kg/h': 'max',
'300 kg/h': 'min'})
aggr_packed_column
water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|
flow type | |||||
big | 300 | [5, 7, 10, 13, 15] | 20.6 | 57.0 | 0 |
small | 275 | [1, 2, 2, 3, 3] | 3.2 | 13.0 | 3 |
6. Handle duplicates#
A common step in data pre-processing is handling duplicates. Depending on the objective, there are multiple ways to handle duplicate rows: they can be aggregated, removed and so on.
Here we will look into how to check whether there are any duplicated rows or values in a column and how to drop these values.
# check if there are duplicates in the dataframe
packed_column.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
# check if there are duplicates in a specific column
packed_column.duplicated(subset=['flow type'])
0 False
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 True
dtype: bool
# drop any duplicates - here the whole row needs to be the same to be dropped
packed_column.drop_duplicates()
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
0 | small | 15 | 1 | 2 | 3.0 | 3 |
1 | small | 30 | 2 | 2 | 5.0 | 3 |
2 | small | 50 | 2 | 3 | 6.0 | 5 |
3 | small | 80 | 3 | 4 | NaN | 9 |
4 | small | 100 | 3 | 5 | 13.0 | 13 |
5 | big | 40 | 5 | 9 | 14.0 | 25 |
6 | big | 50 | 7 | 14 | 22.0 | 39 |
7 | big | 60 | 10 | 19 | 32.0 | 161 |
8 | big | 70 | 13 | 27 | 46.0 | 224 |
9 | big | 80 | 15 | 34 | 57.0 | 0 |
# drop any duplicates in a specified column and keep the last occurrence
packed_column.drop_duplicates(subset=['flow type'], keep='last')
flow type | water flow / air flow | 0 kg/h | 100 kg/h | 200 kg/h | 300 kg/h | |
---|---|---|---|---|---|---|
4 | small | 100 | 3 | 5 | 13.0 | 13 |
9 | big | 80 | 15 | 34 | 57.0 | 0 |