Pandas DataFrames

Pandas DataFrames#

pandas is a very popular Python library, which allows data to be loaded as a dataframe. The library allows many operations to be easily performed on the data, such as calculating the mean or the sum of a column, transpose the dataframe, aggregate values and so on. Moreover, it is widely used for statistical data analysis.

It is very well documented, more information and tutorials about the package can be found here.

For simplicity, the convention is to import pandas as pd, but in principle, it can be called anything, although it is always a good idea to stick to the accepted conventions to ensure code readability.

pandas allow data to be loaded from many different formats and has many functions implemented for this purpose. Here, we will load a .csv file, but you can load Excel, txt, JSON data as well as many other formats.

1. Loading and inspecting data#

import pandas as pd

# if you have problems loading the data, provide the full path
# add the path to where you saved your packed_column.csv here
# the path should look like this: C:/Users/my_user/Desktop/week02/theory/
#path = 
#packed_column = pd.read_csv(f'{path}packed_column.csv', encoding='utf-8', sep=';')

# use read_csv to lead data from .csv file
# here you can see that we provide the encoding and a separator
packed_column = pd.read_csv('packed_column.csv', encoding='utf-8', sep=';')
packed_column

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	small	15	1	2	3	3
1	small	30	2	2	5	3
2	small	50	2	3	6	5
3	small	80	3	4	9	9
4	small	100	3	5	13	13
5	big	40	5	9	14	25
6	big	50	7	14	22	39
7	big	60	10	19	32	161
8	big	70	13	27	46	224
9	big	80	15	34	57	0

In the cell above we loaded our DataFrame from a csv file, but what is actually a DataFrame? It is basically just a matrix (or table, if you prefer thinking about Excel tables) with rows and columns.

Let’s now see some pandas operations

# inspect the top 5 elements in the dataframe
packed_column.head()

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	small	15	1	2	3	3
1	small	30	2	2	5	3
2	small	50	2	3	6	5
3	small	80	3	4	9	9
4	small	100	3	5	13	13

# inspect the top 10 elements in the dataframe
packed_column.head(10)

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	small	15	1	2	3	3
1	small	30	2	2	5	3
2	small	50	2	3	6	5
3	small	80	3	4	9	9
4	small	100	3	5	13	13
5	big	40	5	9	14	25
6	big	50	7	14	22	39
7	big	60	10	19	32	161
8	big	70	13	27	46	224
9	big	80	15	34	57	0

# inspect the bottom 5 elements in the dataframe
packed_column.tail()

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
5	big	40	5	9	14	25
6	big	50	7	14	22	39
7	big	60	10	19	32	161
8	big	70	13	27	46	224
9	big	80	15	34	57	0

# print the columns
packed_column.columns

Index(['flow type', 'water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h',
       '300 kg/h'],
      dtype='object')

# calculate the mean of each column
packed_column[['water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h', '300 kg/h']].mean()

water flow / air flow    57.5
0 kg/h                    6.1
100 kg/h                 11.9
200 kg/h                 20.7
300 kg/h                 48.2
dtype: float64

packed_column[['water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h', '300 kg/h']]

	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	15	1	2	3	3
1	30	2	2	5	3
2	50	2	3	6	5
3	80	3	4	9	9
4	100	3	5	13	13
5	40	5	9	14	25
6	50	7	14	22	39
7	60	10	19	32	161
8	70	13	27	46	224
9	80	15	34	57	0

# calculate the sum of each column
packed_column.sum()

flow type                smallsmallsmallsmallsmallbigbigbigbigbig
water flow / air flow                                         575
0 kg/h                                                         61
100 kg/h                                                      119
200 kg/h                                                      207
300 kg/h                                                      482
dtype: object

Note: describe is a very useful function to remember!

# 'describe' returns many useful statistics about the data
packed_column.describe()

	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
count	10.000000	10.000000	10.000000	10.000000	10.000000
mean	57.500000	6.100000	11.900000	20.700000	48.200000
std	25.739075	4.976612	11.376877	18.571484	78.389058
min	15.000000	1.000000	2.000000	3.000000	0.000000
25%	42.500000	2.250000	3.250000	6.750000	3.500000
50%	55.000000	4.000000	7.000000	13.500000	11.000000
75%	77.500000	9.250000	17.750000	29.500000	35.500000
max	100.000000	15.000000	34.000000	57.000000	224.000000

# transpose a dataframe
packed_column.T

	0	1	2	3	4	5	6	7	8	9
flow type	small	small	small	small	small	big	big	big	big	big
water flow / air flow	15	30	50	80	100	40	50	60	70	80
0 kg/h	1	2	2	3	3	5	7	10	13	15
100 kg/h	2	2	3	4	5	9	14	19	27	34
200 kg/h	3	5	6	9	13	14	22	32	46	57
300 kg/h	3	3	5	9	13	25	39	161	224	0

# summing the rows (removing the categorical row) of the transposed dataframe
packed_column.T[1:].sum()

   24
   42
   66
  105
  134
   93
  132
  282
  380
  186
dtype: object

2. Indexing#

It is also possible to index a DataFrame to update a value or select a subset of rows and columns we want to work with.

We will use the iloc function for this purpose.

# replace the value of row 3 column 4 with None
# remember that in Python we start counting from 0!
packed_column.iloc[3, 4] = None
packed_column

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	small	15	1	2	3.0	3
1	small	30	2	2	5.0	3
2	small	50	2	3	6.0	5
3	small	80	3	4	NaN	9
4	small	100	3	5	13.0	13
5	big	40	5	9	14.0	25
6	big	50	7	14	22.0	39
7	big	60	10	19	32.0	161
8	big	70	13	27	46.0	224
9	big	80	15	34	57.0	0

# select the first row
packed_column.iloc[0]

flow type                small
water flow / air flow       15
0 kg/h                       1
100 kg/h                     2
200 kg/h                   3.0
300 kg/h                     3
Name: 0, dtype: object

# select a specific column
packed_column['water flow / air flow']

   15
   30
   50
   80
  100
   40
   50
   60
   70
   80
Name: water flow / air flow, dtype: int64

3. Replacing missing values#

A very important data pre-processing task is imputing missing values. Depending on the data available and the task to perform, different strategies might be more suitable.

The most common strategies are:

replacing the missing values with a specific value (often 0, but it can be any value)
replacing the missing values with the mean of the columns
dropping (discarding) the rows with missing values.

# fill missing values with 0
packed_column.fillna(0)

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	small	15	1	2	3.0	3
1	small	30	2	2	5.0	3
2	small	50	2	3	6.0	5
3	small	80	3	4	0.0	9
4	small	100	3	5	13.0	13
5	big	40	5	9	14.0	25
6	big	50	7	14	22.0	39
7	big	60	10	19	32.0	161
8	big	70	13	27	46.0	224
9	big	80	15	34	57.0	0

# dropping missing values
packed_column.dropna()

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	small	15	1	2	3.0	3
1	small	30	2	2	5.0	3
2	small	50	2	3	6.0	5
4	small	100	3	5	13.0	13
5	big	40	5	9	14.0	25
6	big	50	7	14	22.0	39
7	big	60	10	19	32.0	161
8	big	70	13	27	46.0	224
9	big	80	15	34	57.0	0

# fill missing values with the mean of a specific column
packed_column.fillna(packed_column['200 kg/h'].mean())

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	small	15	1	2	3.0	3
1	small	30	2	2	5.0	3
2	small	50	2	3	6.0	5
3	small	80	3	4	22.0	9
4	small	100	3	5	13.0	13
5	big	40	5	9	14.0	25
6	big	50	7	14	22.0	39
7	big	60	10	19	32.0	161
8	big	70	13	27	46.0	224
9	big	80	15	34	57.0	0

4. Creating DataFrames from lists and dictionaries#

DataFrames can be created from many different objects: they can be loaded from saved files, as we saw before, but they can also be created from lists and dictionaries, for example.

# create a df from lists
df = pd.DataFrame()
df['column 1'] = list(range(10))
df['column 2'] = list(range(2, 21, 2))
df.head()

	column 1	column 2
0	0	2
1	1	4
2	2	6
3	3	8
4	4	10

# create a df from a list of lists
observations = [[3, 57, 9], [1, 56, 3], [3, 47, 0]]
observations_df = pd.DataFrame(observations)
observations_df

	0	1	2
0	3	57	9
1	1	56	3
2	3	47	0

Note that you usually want your rows and column names to be descriptive (so not just using the index of the row/column), this will also help you index and perform other operations.

# create a df from a dictionary
dict_conditions = {
    "temperature": 30,
    "pH": 7.0,
    "pressure": 1.0
}

df = pd.DataFrame.from_dict(dict_conditions, orient='index')
df

	0
temperature	30.0
pH	7.0
pressure	1.0

# you can reset the index to have your keys as a column instead of the index
df = df.reset_index()
df

	index	0
0	temperature	30.0
1	pH	7.0
2	pressure	1.0

5. Groupby#

A very useful that you can do in pandas is to aggregate by a certain value across the dataframe and add rows that present the same value according to a specified logic. Read the official documentation here

# group dataframe by the flow type and calculate the average water flow
packed_column.groupby(['flow type']).mean()

	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
flow type
big	60.0	10.0	20.6	34.20	89.8
small	55.0	2.2	3.2	6.75	6.6

We can of course save the new aggregated dataframe.

aggr_packed_column = packed_column.groupby(['flow type']).mean()
aggr_packed_column

	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
flow type
big	60.0	10.0	20.6	34.20	89.8
small	55.0	2.2	3.2	6.75	6.6

What if we want different metrics in different columns?

aggr_packed_column = packed_column.groupby(['flow type']).agg({'water flow / air flow': 'sum',
                                                              '0 kg/h': list,
                                                               '100 kg/h': 'mean',
                                                               '200 kg/h': 'max',
                                                               '300 kg/h': 'min'})
aggr_packed_column

	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
flow type
big	300	[5, 7, 10, 13, 15]	20.6	57.0	0
small	275	[1, 2, 2, 3, 3]	3.2	13.0	3

6. Handle duplicates#

A common step in data pre-processing is handling duplicates. Depending on the objective, there are multiple ways to handle duplicate rows: they can be aggregated, removed and so on.

Here we will look into how to check whether there are any duplicated rows or values in a column and how to drop these values.

# check if there are duplicates in the dataframe
packed_column.duplicated()

  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
dtype: bool

# check if there are duplicates in a specific column
packed_column.duplicated(subset=['flow type'])

  False
   True
   True
   True
   True
  False
   True
   True
   True
   True
dtype: bool

# drop any duplicates - here the whole row needs to be the same to be dropped
packed_column.drop_duplicates()

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	small	15	1	2	3.0	3
1	small	30	2	2	5.0	3
2	small	50	2	3	6.0	5
3	small	80	3	4	NaN	9
4	small	100	3	5	13.0	13
5	big	40	5	9	14.0	25
6	big	50	7	14	22.0	39
7	big	60	10	19	32.0	161
8	big	70	13	27	46.0	224
9	big	80	15	34	57.0	0

# drop any duplicates in a specified column and keep the last occurrence
packed_column.drop_duplicates(subset=['flow type'], keep='last')

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
4	small	100	3	5	13.0	13
9	big	80	15	34	57.0	0

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
5	big	40	5	9	14	25
6	big	50	7	14	22	39
7	big	60	10	19	32	161
8	big	70	13	27	46	224
9	big	80	15	34	57	0

	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	15	1	2	3	3
1	30	2	2	5	3
2	50	2	3	6	5
3	80	3	4	9	9
4	100	3	5	13	13
5	40	5	9	14	25
6	50	7	14	22	39
7	60	10	19	32	161
8	70	13	27	46	224
9	80	15	34	57	0

	flow type	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
5	big	40	5	9	14	25
6	big	50	7	14	22	39
7	big	60	10	19	32	161
8	big	70	13	27	46	224
9	big	80	15	34	57	0

	water flow / air flow	0 kg/h	100 kg/h	200 kg/h	300 kg/h
0	15	1	2	3	3
1	30	2	2	5	3
2	50	2	3	6	5
3	80	3	4	9	9
4	100	3	5	13	13
5	40	5	9	14	25
6	50	7	14	22	39
7	60	10	19	32	161
8	70	13	27	46	224
9	80	15	34	57	0