Pandas DataFrames#

pandas is a very popular Python library, which allows data to be loaded as a dataframe. The library allows many operations to be easily performed on the data, such as calculating the mean or the sum of a column, transpose the dataframe, aggregate values and so on. Moreover, it is widely used for statistical data analysis.

It is very well documented, more information and tutorials about the package can be found here.

For simplicity, the convention is to import pandas as pd, but in principle, it can be called anything, although it is always a good idea to stick to the accepted conventions to ensure code readability.

pandas allow data to be loaded from many different formats and has many functions implemented for this purpose. Here, we will load a .csv file, but you can load Excel, txt, JSON data as well as many other formats.

1. Loading and inspecting data#

import pandas as pd

# if you have problems loading the data, provide the full path
# add the path to where you saved your packed_column.csv here
# the path should look like this: C:/Users/my_user/Desktop/week02/theory/
#path = 
#packed_column = pd.read_csv(f'{path}packed_column.csv', encoding='utf-8', sep=';')

# use read_csv to lead data from .csv file
# here you can see that we provide the encoding and a separator
packed_column = pd.read_csv('packed_column.csv', encoding='utf-8', sep=';')
packed_column
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
0 small 15 1 2 3 3
1 small 30 2 2 5 3
2 small 50 2 3 6 5
3 small 80 3 4 9 9
4 small 100 3 5 13 13
5 big 40 5 9 14 25
6 big 50 7 14 22 39
7 big 60 10 19 32 161
8 big 70 13 27 46 224
9 big 80 15 34 57 0

In the cell above we loaded our DataFrame from a csv file, but what is actually a DataFrame? It is basically just a matrix (or table, if you prefer thinking about Excel tables) with rows and columns.

Let’s now see some pandas operations

# inspect the top 5 elements in the dataframe
packed_column.head()
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
0 small 15 1 2 3 3
1 small 30 2 2 5 3
2 small 50 2 3 6 5
3 small 80 3 4 9 9
4 small 100 3 5 13 13
# inspect the top 10 elements in the dataframe
packed_column.head(10)
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
0 small 15 1 2 3 3
1 small 30 2 2 5 3
2 small 50 2 3 6 5
3 small 80 3 4 9 9
4 small 100 3 5 13 13
5 big 40 5 9 14 25
6 big 50 7 14 22 39
7 big 60 10 19 32 161
8 big 70 13 27 46 224
9 big 80 15 34 57 0
# inspect the bottom 5 elements in the dataframe
packed_column.tail()
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
5 big 40 5 9 14 25
6 big 50 7 14 22 39
7 big 60 10 19 32 161
8 big 70 13 27 46 224
9 big 80 15 34 57 0
# print the columns
packed_column.columns
Index(['flow type', 'water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h',
       '300 kg/h'],
      dtype='object')
# calculate the mean of each column
packed_column[['water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h', '300 kg/h']].mean()
water flow / air flow    57.5
0 kg/h                    6.1
100 kg/h                 11.9
200 kg/h                 20.7
300 kg/h                 48.2
dtype: float64
packed_column[['water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h', '300 kg/h']]
water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
0 15 1 2 3 3
1 30 2 2 5 3
2 50 2 3 6 5
3 80 3 4 9 9
4 100 3 5 13 13
5 40 5 9 14 25
6 50 7 14 22 39
7 60 10 19 32 161
8 70 13 27 46 224
9 80 15 34 57 0
# calculate the sum of each column
packed_column.sum()
flow type                smallsmallsmallsmallsmallbigbigbigbigbig
water flow / air flow                                         575
0 kg/h                                                         61
100 kg/h                                                      119
200 kg/h                                                      207
300 kg/h                                                      482
dtype: object

Note: describe is a very useful function to remember!

# 'describe' returns many useful statistics about the data
packed_column.describe()
water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 57.500000 6.100000 11.900000 20.700000 48.200000
std 25.739075 4.976612 11.376877 18.571484 78.389058
min 15.000000 1.000000 2.000000 3.000000 0.000000
25% 42.500000 2.250000 3.250000 6.750000 3.500000
50% 55.000000 4.000000 7.000000 13.500000 11.000000
75% 77.500000 9.250000 17.750000 29.500000 35.500000
max 100.000000 15.000000 34.000000 57.000000 224.000000
# transpose a dataframe
packed_column.T
0 1 2 3 4 5 6 7 8 9
flow type small small small small small big big big big big
water flow / air flow 15 30 50 80 100 40 50 60 70 80
0 kg/h 1 2 2 3 3 5 7 10 13 15
100 kg/h 2 2 3 4 5 9 14 19 27 34
200 kg/h 3 5 6 9 13 14 22 32 46 57
300 kg/h 3 3 5 9 13 25 39 161 224 0
# summing the rows (removing the categorical row) of the transposed dataframe
packed_column.T[1:].sum()
0     24
1     42
2     66
3    105
4    134
5     93
6    132
7    282
8    380
9    186
dtype: object

2. Indexing#

It is also possible to index a DataFrame to update a value or select a subset of rows and columns we want to work with.

We will use the iloc function for this purpose.

# replace the value of row 3 column 4 with None
# remember that in Python we start counting from 0!
packed_column.iloc[3, 4] = None
packed_column
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
0 small 15 1 2 3.0 3
1 small 30 2 2 5.0 3
2 small 50 2 3 6.0 5
3 small 80 3 4 NaN 9
4 small 100 3 5 13.0 13
5 big 40 5 9 14.0 25
6 big 50 7 14 22.0 39
7 big 60 10 19 32.0 161
8 big 70 13 27 46.0 224
9 big 80 15 34 57.0 0
# select the first row
packed_column.iloc[0]
flow type                small
water flow / air flow       15
0 kg/h                       1
100 kg/h                     2
200 kg/h                   3.0
300 kg/h                     3
Name: 0, dtype: object
# select a specific column
packed_column['water flow / air flow']
0     15
1     30
2     50
3     80
4    100
5     40
6     50
7     60
8     70
9     80
Name: water flow / air flow, dtype: int64

3. Replacing missing values#

A very important data pre-processing task is imputing missing values. Depending on the data available and the task to perform, different strategies might be more suitable.

The most common strategies are:

  • replacing the missing values with a specific value (often 0, but it can be any value)

  • replacing the missing values with the mean of the columns

  • dropping (discarding) the rows with missing values.

# fill missing values with 0
packed_column.fillna(0)
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
0 small 15 1 2 3.0 3
1 small 30 2 2 5.0 3
2 small 50 2 3 6.0 5
3 small 80 3 4 0.0 9
4 small 100 3 5 13.0 13
5 big 40 5 9 14.0 25
6 big 50 7 14 22.0 39
7 big 60 10 19 32.0 161
8 big 70 13 27 46.0 224
9 big 80 15 34 57.0 0
# dropping missing values
packed_column.dropna()
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
0 small 15 1 2 3.0 3
1 small 30 2 2 5.0 3
2 small 50 2 3 6.0 5
4 small 100 3 5 13.0 13
5 big 40 5 9 14.0 25
6 big 50 7 14 22.0 39
7 big 60 10 19 32.0 161
8 big 70 13 27 46.0 224
9 big 80 15 34 57.0 0
# fill missing values with the mean of a specific column
packed_column.fillna(packed_column['200 kg/h'].mean())
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
0 small 15 1 2 3.0 3
1 small 30 2 2 5.0 3
2 small 50 2 3 6.0 5
3 small 80 3 4 22.0 9
4 small 100 3 5 13.0 13
5 big 40 5 9 14.0 25
6 big 50 7 14 22.0 39
7 big 60 10 19 32.0 161
8 big 70 13 27 46.0 224
9 big 80 15 34 57.0 0

4. Creating DataFrames from lists and dictionaries#

DataFrames can be created from many different objects: they can be loaded from saved files, as we saw before, but they can also be created from lists and dictionaries, for example.

# create a df from lists
df = pd.DataFrame()
df['column 1'] = list(range(10))
df['column 2'] = list(range(2, 21, 2))
df.head()
column 1 column 2
0 0 2
1 1 4
2 2 6
3 3 8
4 4 10
# create a df from a list of lists
observations = [[3, 57, 9], [1, 56, 3], [3, 47, 0]]
observations_df = pd.DataFrame(observations)
observations_df
0 1 2
0 3 57 9
1 1 56 3
2 3 47 0

Note that you usually want your rows and column names to be descriptive (so not just using the index of the row/column), this will also help you index and perform other operations.

# create a df from a dictionary
dict_conditions = {
    "temperature": 30,
    "pH": 7.0,
    "pressure": 1.0
}

df = pd.DataFrame.from_dict(dict_conditions, orient='index')
df
0
temperature 30.0
pH 7.0
pressure 1.0
# you can reset the index to have your keys as a column instead of the index
df = df.reset_index()
df
index 0
0 temperature 30.0
1 pH 7.0
2 pressure 1.0

5. Groupby#

A very useful that you can do in pandas is to aggregate by a certain value across the dataframe and add rows that present the same value according to a specified logic. Read the official documentation here

# group dataframe by the flow type and calculate the average water flow
packed_column.groupby(['flow type']).mean()
water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
flow type
big 60.0 10.0 20.6 34.20 89.8
small 55.0 2.2 3.2 6.75 6.6

We can of course save the new aggregated dataframe.

aggr_packed_column = packed_column.groupby(['flow type']).mean()
aggr_packed_column
water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
flow type
big 60.0 10.0 20.6 34.20 89.8
small 55.0 2.2 3.2 6.75 6.6

What if we want different metrics in different columns?

aggr_packed_column = packed_column.groupby(['flow type']).agg({'water flow / air flow': 'sum',
                                                              '0 kg/h': list,
                                                               '100 kg/h': 'mean',
                                                               '200 kg/h': 'max',
                                                               '300 kg/h': 'min'})
aggr_packed_column
water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
flow type
big 300 [5, 7, 10, 13, 15] 20.6 57.0 0
small 275 [1, 2, 2, 3, 3] 3.2 13.0 3

6. Handle duplicates#

A common step in data pre-processing is handling duplicates. Depending on the objective, there are multiple ways to handle duplicate rows: they can be aggregated, removed and so on.

Here we will look into how to check whether there are any duplicated rows or values in a column and how to drop these values.

# check if there are duplicates in the dataframe
packed_column.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool
# check if there are duplicates in a specific column
packed_column.duplicated(subset=['flow type'])
0    False
1     True
2     True
3     True
4     True
5    False
6     True
7     True
8     True
9     True
dtype: bool
# drop any duplicates - here the whole row needs to be the same to be dropped
packed_column.drop_duplicates()
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
0 small 15 1 2 3.0 3
1 small 30 2 2 5.0 3
2 small 50 2 3 6.0 5
3 small 80 3 4 NaN 9
4 small 100 3 5 13.0 13
5 big 40 5 9 14.0 25
6 big 50 7 14 22.0 39
7 big 60 10 19 32.0 161
8 big 70 13 27 46.0 224
9 big 80 15 34 57.0 0
# drop any duplicates in a specified column and keep the last occurrence
packed_column.drop_duplicates(subset=['flow type'], keep='last')
flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h
4 small 100 3 5 13.0 13
9 big 80 15 34 57.0 0