{ "cells": [ { "cell_type": "markdown", "id": "26f64782-0175-4b83-b337-2e3eb95f010c", "metadata": {}, "source": [ "# Pandas DataFrames" ] }, { "cell_type": "markdown", "id": "2a84422f-8838-49d1-b24a-096171ff8760", "metadata": {}, "source": [ "`pandas` is a very popular Python library, which allows data to be loaded as a dataframe. \n", "The library allows many operations to be easily performed on the data, such as calculating the mean or the sum of a column, transpose the dataframe, aggregate values and so on. Moreover, it is widely used for statistical data analysis.\n", "\n", "It is very well documented, more information and tutorials about the package can be found [here](https://pandas.pydata.org/docs/index.html).\n", "\n", "For simplicity, the convention is to import `pandas` as `pd`, but in principle, it can be called anything, although it is always a good idea to stick to the accepted conventions to ensure code readability. " ] }, { "cell_type": "markdown", "id": "4e5f6325-24b3-474e-954e-a446786e0dca", "metadata": {}, "source": [ "`pandas` allow data to be loaded from many different formats and has many functions implemented for this purpose.\n", "Here, we will load a .csv file, but you can load Excel, txt, JSON data as well as many other formats." ] }, { "cell_type": "markdown", "id": "bb313b60-7ad4-466f-aeb2-324afdb23c3c", "metadata": {}, "source": [ "## 1. Loading and inspecting data" ] }, { "cell_type": "code", "execution_count": 1, "id": "37e911d8-d9ba-4b69-82b0-6e67ff021e9a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
0small151233
1small302253
2small502365
3small803499
4small100351313
5big40591425
6big507142239
7big60101932161
8big70132746224
9big801534570
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "0 small 15 1 2 3 3\n", "1 small 30 2 2 5 3\n", "2 small 50 2 3 6 5\n", "3 small 80 3 4 9 9\n", "4 small 100 3 5 13 13\n", "5 big 40 5 9 14 25\n", "6 big 50 7 14 22 39\n", "7 big 60 10 19 32 161\n", "8 big 70 13 27 46 224\n", "9 big 80 15 34 57 0" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# if you have problems loading the data, provide the full path\n", "# add the path to where you saved your packed_column.csv here\n", "# the path should look like this: C:/Users/my_user/Desktop/week02/theory/\n", "#path = \n", "#packed_column = pd.read_csv(f'{path}packed_column.csv', encoding='utf-8', sep=';')\n", "\n", "# use read_csv to lead data from .csv file\n", "# here you can see that we provide the encoding and a separator\n", "packed_column = pd.read_csv('packed_column.csv', encoding='utf-8', sep=';')\n", "packed_column" ] }, { "cell_type": "markdown", "id": "98e4c935-990b-4653-9492-98d2ec515b69", "metadata": {}, "source": [ "In the cell above we loaded our DataFrame from a csv file, but what is actually a *DataFrame*?\n", "It is basically just a matrix (or table, if you prefer thinking about Excel tables) with rows and columns." ] }, { "cell_type": "markdown", "id": "d4a9f427-778d-4ba6-b684-d5af1466ebab", "metadata": {}, "source": [ "Let's now see some `pandas` operations" ] }, { "cell_type": "code", "execution_count": 2, "id": "6cb324a4-2f51-440c-a8e5-c5512d550e8c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
0small151233
1small302253
2small502365
3small803499
4small100351313
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "0 small 15 1 2 3 3\n", "1 small 30 2 2 5 3\n", "2 small 50 2 3 6 5\n", "3 small 80 3 4 9 9\n", "4 small 100 3 5 13 13" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspect the top 5 elements in the dataframe\n", "packed_column.head()" ] }, { "cell_type": "code", "execution_count": 3, "id": "0867f753-ef63-4e43-829d-0f016b47541c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
0small151233
1small302253
2small502365
3small803499
4small100351313
5big40591425
6big507142239
7big60101932161
8big70132746224
9big801534570
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "0 small 15 1 2 3 3\n", "1 small 30 2 2 5 3\n", "2 small 50 2 3 6 5\n", "3 small 80 3 4 9 9\n", "4 small 100 3 5 13 13\n", "5 big 40 5 9 14 25\n", "6 big 50 7 14 22 39\n", "7 big 60 10 19 32 161\n", "8 big 70 13 27 46 224\n", "9 big 80 15 34 57 0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspect the top 10 elements in the dataframe\n", "packed_column.head(10)" ] }, { "cell_type": "code", "execution_count": 4, "id": "8340f18f-f232-4795-8f0a-52d3c8bf39ab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
5big40591425
6big507142239
7big60101932161
8big70132746224
9big801534570
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "5 big 40 5 9 14 25\n", "6 big 50 7 14 22 39\n", "7 big 60 10 19 32 161\n", "8 big 70 13 27 46 224\n", "9 big 80 15 34 57 0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspect the bottom 5 elements in the dataframe\n", "packed_column.tail()" ] }, { "cell_type": "code", "execution_count": 5, "id": "e5edc6ab-aaff-4a28-be4e-de18bbec35a3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['flow type', 'water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h',\n", " '300 kg/h'],\n", " dtype='object')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# print the columns\n", "packed_column.columns" ] }, { "cell_type": "code", "execution_count": 6, "id": "0584a863-6ff9-41a3-970c-6d609741ec89", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "water flow / air flow 57.5\n", "0 kg/h 6.1\n", "100 kg/h 11.9\n", "200 kg/h 20.7\n", "300 kg/h 48.2\n", "dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the mean of each column\n", "packed_column[['water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h', '300 kg/h']].mean()" ] }, { "cell_type": "code", "execution_count": 7, "id": "1cf4d185-1567-49aa-a12a-b22f56db2638", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
water flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
0151233
1302253
2502365
3803499
4100351313
540591425
6507142239
760101932161
870132746224
9801534570
\n", "
" ], "text/plain": [ " water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "0 15 1 2 3 3\n", "1 30 2 2 5 3\n", "2 50 2 3 6 5\n", "3 80 3 4 9 9\n", "4 100 3 5 13 13\n", "5 40 5 9 14 25\n", "6 50 7 14 22 39\n", "7 60 10 19 32 161\n", "8 70 13 27 46 224\n", "9 80 15 34 57 0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "packed_column[['water flow / air flow', '0 kg/h', '100 kg/h', '200 kg/h', '300 kg/h']]" ] }, { "cell_type": "code", "execution_count": 7, "id": "b2792d82-644d-4163-bf5a-4b5bef7aa6e4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "flow type smallsmallsmallsmallsmallbigbigbigbigbig\n", "water flow / air flow 575\n", "0 kg/h 61\n", "100 kg/h 119\n", "200 kg/h 207\n", "300 kg/h 482\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the sum of each column\n", "packed_column.sum()" ] }, { "cell_type": "markdown", "id": "d6fe0f72-85e1-4183-b3c3-13702083f05e", "metadata": {}, "source": [ "**Note: `describe` is a very useful function to remember!**" ] }, { "cell_type": "code", "execution_count": 8, "id": "30107fb4-2df5-4e82-8e69-d1117342149c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
water flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
count10.00000010.00000010.00000010.00000010.000000
mean57.5000006.10000011.90000020.70000048.200000
std25.7390754.97661211.37687718.57148478.389058
min15.0000001.0000002.0000003.0000000.000000
25%42.5000002.2500003.2500006.7500003.500000
50%55.0000004.0000007.00000013.50000011.000000
75%77.5000009.25000017.75000029.50000035.500000
max100.00000015.00000034.00000057.000000224.000000
\n", "
" ], "text/plain": [ " water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "count 10.000000 10.000000 10.000000 10.000000 10.000000\n", "mean 57.500000 6.100000 11.900000 20.700000 48.200000\n", "std 25.739075 4.976612 11.376877 18.571484 78.389058\n", "min 15.000000 1.000000 2.000000 3.000000 0.000000\n", "25% 42.500000 2.250000 3.250000 6.750000 3.500000\n", "50% 55.000000 4.000000 7.000000 13.500000 11.000000\n", "75% 77.500000 9.250000 17.750000 29.500000 35.500000\n", "max 100.000000 15.000000 34.000000 57.000000 224.000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'describe' returns many useful statistics about the data\n", "packed_column.describe()" ] }, { "cell_type": "code", "execution_count": 9, "id": "829e2fc5-9a58-48a5-8b2e-aa9c0abd461a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789
flow typesmallsmallsmallsmallsmallbigbigbigbigbig
water flow / air flow153050801004050607080
0 kg/h1223357101315
100 kg/h22345914192734
200 kg/h3569131422324657
300 kg/h33591325391612240
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 \\\n", "flow type small small small small small big big big big \n", "water flow / air flow 15 30 50 80 100 40 50 60 70 \n", "0 kg/h 1 2 2 3 3 5 7 10 13 \n", "100 kg/h 2 2 3 4 5 9 14 19 27 \n", "200 kg/h 3 5 6 9 13 14 22 32 46 \n", "300 kg/h 3 3 5 9 13 25 39 161 224 \n", "\n", " 9 \n", "flow type big \n", "water flow / air flow 80 \n", "0 kg/h 15 \n", "100 kg/h 34 \n", "200 kg/h 57 \n", "300 kg/h 0 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# transpose a dataframe\n", "packed_column.T" ] }, { "cell_type": "code", "execution_count": 10, "id": "0db8c8a8-e09e-4532-a962-ce08225a9e9b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 24\n", "1 42\n", "2 66\n", "3 105\n", "4 134\n", "5 93\n", "6 132\n", "7 282\n", "8 380\n", "9 186\n", "dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# summing the rows (removing the categorical row) of the transposed dataframe\n", "packed_column.T[1:].sum()" ] }, { "cell_type": "markdown", "id": "14181bf6-3adb-44f4-8a9c-ae24ebc0e70f", "metadata": {}, "source": [ "## 2. Indexing\n", "\n", "It is also possible to *index* a DataFrame to update a value or select a subset of rows and columns we want to work with.\n", "\n", "We will use the `iloc` function for this purpose." ] }, { "cell_type": "code", "execution_count": 11, "id": "c12c712a-e09d-4311-b399-0eda84db1fc5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
0small15123.03
1small30225.03
2small50236.05
3small8034NaN9
4small1003513.013
5big405914.025
6big5071422.039
7big60101932.0161
8big70132746.0224
9big80153457.00
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "0 small 15 1 2 3.0 3\n", "1 small 30 2 2 5.0 3\n", "2 small 50 2 3 6.0 5\n", "3 small 80 3 4 NaN 9\n", "4 small 100 3 5 13.0 13\n", "5 big 40 5 9 14.0 25\n", "6 big 50 7 14 22.0 39\n", "7 big 60 10 19 32.0 161\n", "8 big 70 13 27 46.0 224\n", "9 big 80 15 34 57.0 0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replace the value of row 3 column 4 with None\n", "# remember that in Python we start counting from 0!\n", "packed_column.iloc[3, 4] = None\n", "packed_column" ] }, { "cell_type": "code", "execution_count": 12, "id": "19502a8c-de0b-4b0c-bed0-600894ecbb22", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "flow type small\n", "water flow / air flow 15\n", "0 kg/h 1\n", "100 kg/h 2\n", "200 kg/h 3.0\n", "300 kg/h 3\n", "Name: 0, dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select the first row\n", "packed_column.iloc[0]" ] }, { "cell_type": "code", "execution_count": 13, "id": "527c8e1e-042c-4caa-b3f7-42bd059ef474", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 15\n", "1 30\n", "2 50\n", "3 80\n", "4 100\n", "5 40\n", "6 50\n", "7 60\n", "8 70\n", "9 80\n", "Name: water flow / air flow, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select a specific column\n", "packed_column['water flow / air flow']" ] }, { "cell_type": "markdown", "id": "c36a7bcf-5496-470a-81fb-7e6340007a58", "metadata": {}, "source": [ "## 3. Replacing missing values\n", "\n", "A very important data pre-processing task is imputing missing values.\n", "Depending on the data available and the task to perform, different strategies might be more suitable.\n", "\n", "The most common strategies are:\n", "- replacing the missing values with a specific value (often 0, but it can be any value)\n", "- replacing the missing values with the mean of the columns\n", "- dropping (discarding) the rows with missing values." ] }, { "cell_type": "code", "execution_count": 14, "id": "320f0c53-8e54-4e33-ab2e-28c092380506", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
0small15123.03
1small30225.03
2small50236.05
3small80340.09
4small1003513.013
5big405914.025
6big5071422.039
7big60101932.0161
8big70132746.0224
9big80153457.00
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "0 small 15 1 2 3.0 3\n", "1 small 30 2 2 5.0 3\n", "2 small 50 2 3 6.0 5\n", "3 small 80 3 4 0.0 9\n", "4 small 100 3 5 13.0 13\n", "5 big 40 5 9 14.0 25\n", "6 big 50 7 14 22.0 39\n", "7 big 60 10 19 32.0 161\n", "8 big 70 13 27 46.0 224\n", "9 big 80 15 34 57.0 0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fill missing values with 0\n", "packed_column.fillna(0)" ] }, { "cell_type": "code", "execution_count": 15, "id": "c3225c8f-11ce-4e42-aada-0995516a43f8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
0small15123.03
1small30225.03
2small50236.05
4small1003513.013
5big405914.025
6big5071422.039
7big60101932.0161
8big70132746.0224
9big80153457.00
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "0 small 15 1 2 3.0 3\n", "1 small 30 2 2 5.0 3\n", "2 small 50 2 3 6.0 5\n", "4 small 100 3 5 13.0 13\n", "5 big 40 5 9 14.0 25\n", "6 big 50 7 14 22.0 39\n", "7 big 60 10 19 32.0 161\n", "8 big 70 13 27 46.0 224\n", "9 big 80 15 34 57.0 0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# dropping missing values\n", "packed_column.dropna()" ] }, { "cell_type": "code", "execution_count": 16, "id": "a7b4a53b-2858-460c-9240-3940fc1770fe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
0small15123.03
1small30225.03
2small50236.05
3small803422.09
4small1003513.013
5big405914.025
6big5071422.039
7big60101932.0161
8big70132746.0224
9big80153457.00
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "0 small 15 1 2 3.0 3\n", "1 small 30 2 2 5.0 3\n", "2 small 50 2 3 6.0 5\n", "3 small 80 3 4 22.0 9\n", "4 small 100 3 5 13.0 13\n", "5 big 40 5 9 14.0 25\n", "6 big 50 7 14 22.0 39\n", "7 big 60 10 19 32.0 161\n", "8 big 70 13 27 46.0 224\n", "9 big 80 15 34 57.0 0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fill missing values with the mean of a specific column\n", "packed_column.fillna(packed_column['200 kg/h'].mean())" ] }, { "cell_type": "markdown", "id": "32b9b2a8-f3f8-4896-92dc-e167bb6f77a0", "metadata": {}, "source": [ "## 4. Creating DataFrames from lists and dictionaries\n", "\n", "DataFrames can be created from many different objects: they can be loaded from saved files, as we saw before, but they can also be created from lists and dictionaries, for example." ] }, { "cell_type": "code", "execution_count": 17, "id": "1ac47737-5cab-4c46-8a14-56ca05d9c660", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2
002
114
226
338
4410
\n", "
" ], "text/plain": [ " column 1 column 2\n", "0 0 2\n", "1 1 4\n", "2 2 6\n", "3 3 8\n", "4 4 10" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a df from lists\n", "df = pd.DataFrame()\n", "df['column 1'] = list(range(10))\n", "df['column 2'] = list(range(2, 21, 2))\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 18, "id": "54c70227-709f-4cdd-b479-2e5763581630", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
03579
11563
23470
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 3 57 9\n", "1 1 56 3\n", "2 3 47 0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a df from a list of lists\n", "observations = [[3, 57, 9], [1, 56, 3], [3, 47, 0]]\n", "observations_df = pd.DataFrame(observations)\n", "observations_df" ] }, { "cell_type": "markdown", "id": "bc4549e7-4fcc-46ee-8012-888832d0990c", "metadata": {}, "source": [ "Note that you usually want your rows and column names to be descriptive (so not just using the index of the row/column), this will also help you index and perform other operations." ] }, { "cell_type": "code", "execution_count": 19, "id": "2a33c91e-fe73-42da-b43e-83e2ddf38c5a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
temperature30.0
pH7.0
pressure1.0
\n", "
" ], "text/plain": [ " 0\n", "temperature 30.0\n", "pH 7.0\n", "pressure 1.0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a df from a dictionary\n", "dict_conditions = {\n", " \"temperature\": 30,\n", " \"pH\": 7.0,\n", " \"pressure\": 1.0\n", "}\n", "\n", "df = pd.DataFrame.from_dict(dict_conditions, orient='index')\n", "df" ] }, { "cell_type": "code", "execution_count": 20, "id": "b5a57f9d-ec85-47bc-ae65-dfa38e1904f5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
index0
0temperature30.0
1pH7.0
2pressure1.0
\n", "
" ], "text/plain": [ " index 0\n", "0 temperature 30.0\n", "1 pH 7.0\n", "2 pressure 1.0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# you can reset the index to have your keys as a column instead of the index\n", "df = df.reset_index()\n", "df" ] }, { "cell_type": "markdown", "id": "41daf6e1-d53b-49f1-821c-9c36765f579b", "metadata": {}, "source": [ "## 5. Groupby" ] }, { "cell_type": "markdown", "id": "2c6c7dd0-2327-4452-b29b-d32b95393ff1", "metadata": {}, "source": [ "A very useful that you can do in pandas is to aggregate by a certain value across the dataframe and add rows that present the same value according to a specified logic.\n", "Read the official documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)" ] }, { "cell_type": "code", "execution_count": 21, "id": "eb3bea3f-6132-4c16-8fc9-cf422d1a7fe1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
water flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
flow type
big60.010.020.634.2089.8
small55.02.23.26.756.6
\n", "
" ], "text/plain": [ " water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "flow type \n", "big 60.0 10.0 20.6 34.20 89.8\n", "small 55.0 2.2 3.2 6.75 6.6" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# group dataframe by the flow type and calculate the average water flow\n", "packed_column.groupby(['flow type']).mean()" ] }, { "cell_type": "markdown", "id": "42524b5f-0807-49ea-b8d1-8941578abffc", "metadata": {}, "source": [ "We can of course save the new aggregated dataframe." ] }, { "cell_type": "code", "execution_count": 22, "id": "bdc54384-c1f4-4b8b-be7f-868ad022c69d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
water flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
flow type
big60.010.020.634.2089.8
small55.02.23.26.756.6
\n", "
" ], "text/plain": [ " water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "flow type \n", "big 60.0 10.0 20.6 34.20 89.8\n", "small 55.0 2.2 3.2 6.75 6.6" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aggr_packed_column = packed_column.groupby(['flow type']).mean()\n", "aggr_packed_column" ] }, { "cell_type": "markdown", "id": "30f60eb1-8f40-4512-8f54-e61336e7d56b", "metadata": {}, "source": [ "What if we want different metrics in different columns?" ] }, { "cell_type": "code", "execution_count": 23, "id": "3a78cffb-9609-4157-a430-194e0ec89d07", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
water flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
flow type
big300[5, 7, 10, 13, 15]20.657.00
small275[1, 2, 2, 3, 3]3.213.03
\n", "
" ], "text/plain": [ " water flow / air flow 0 kg/h 100 kg/h 200 kg/h \\\n", "flow type \n", "big 300 [5, 7, 10, 13, 15] 20.6 57.0 \n", "small 275 [1, 2, 2, 3, 3] 3.2 13.0 \n", "\n", " 300 kg/h \n", "flow type \n", "big 0 \n", "small 3 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aggr_packed_column = packed_column.groupby(['flow type']).agg({'water flow / air flow': 'sum',\n", " '0 kg/h': list,\n", " '100 kg/h': 'mean',\n", " '200 kg/h': 'max',\n", " '300 kg/h': 'min'})\n", "aggr_packed_column" ] }, { "cell_type": "markdown", "id": "bcc8e797-d9df-4d3b-bb2c-26713112d4ce", "metadata": {}, "source": [ "## 6. Handle duplicates" ] }, { "cell_type": "markdown", "id": "9422601b-94da-4b3f-9862-3cbe047c0ab4", "metadata": {}, "source": [ "A common step in data pre-processing is handling duplicates.\n", "Depending on the objective, there are multiple ways to handle duplicate rows: they can be aggregated, removed and so on.\n", "\n", "Here we will look into how to check whether there are any duplicated rows or values in a column and how to drop these values." ] }, { "cell_type": "code", "execution_count": 24, "id": "bd1ab91c-eeca-4d71-ad6d-7582a8ea3fc4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "dtype: bool" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check if there are duplicates in the dataframe\n", "packed_column.duplicated()" ] }, { "cell_type": "code", "execution_count": 25, "id": "ada9a4d0-e4ce-4de0-a7e5-d87eef2c40b4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 True\n", "3 True\n", "4 True\n", "5 False\n", "6 True\n", "7 True\n", "8 True\n", "9 True\n", "dtype: bool" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check if there are duplicates in a specific column\n", "packed_column.duplicated(subset=['flow type'])" ] }, { "cell_type": "code", "execution_count": 26, "id": "a1a0d8f3-5938-4d57-898c-0c8a1455d290", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
0small15123.03
1small30225.03
2small50236.05
3small8034NaN9
4small1003513.013
5big405914.025
6big5071422.039
7big60101932.0161
8big70132746.0224
9big80153457.00
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "0 small 15 1 2 3.0 3\n", "1 small 30 2 2 5.0 3\n", "2 small 50 2 3 6.0 5\n", "3 small 80 3 4 NaN 9\n", "4 small 100 3 5 13.0 13\n", "5 big 40 5 9 14.0 25\n", "6 big 50 7 14 22.0 39\n", "7 big 60 10 19 32.0 161\n", "8 big 70 13 27 46.0 224\n", "9 big 80 15 34 57.0 0" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop any duplicates - here the whole row needs to be the same to be dropped\n", "packed_column.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 27, "id": "cc824fb7-3cef-45ef-bfd9-b3cb6a13d76a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flow typewater flow / air flow0 kg/h100 kg/h200 kg/h300 kg/h
4small1003513.013
9big80153457.00
\n", "
" ], "text/plain": [ " flow type water flow / air flow 0 kg/h 100 kg/h 200 kg/h 300 kg/h\n", "4 small 100 3 5 13.0 13\n", "9 big 80 15 34 57.0 0" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop any duplicates in a specified column and keep the last occurrence\n", "packed_column.drop_duplicates(subset=['flow type'], keep='last')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }