{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "2KGt0M0jmhmT"
},
"source": [
"# 11. Real-world problems and additional topics\n",
"\n",
"During this course we have either used simple demo examples designed to illustrate specific aspects of Pandas, or real-world data where we ignored some of the potentials problems. In this notebook, we address a few of the common problems that you might encounter. We also address some other common situations that you might encounter such as dealing with specific formats like dates."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "ZOGIWe8fm-lT",
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nO7Di3iDm7Y1"
},
"source": [
"## Absent values\n",
"\n",
"Very often, data that you need are imperfect: they migth have missing values or some values might be \"incorrect\", e.g. in a column supposed to contain numbers you suddently encounter text. Pandas offers many ways to fix these problems.\n",
"\n",
"Let's once again import our penguin data:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "5m76IE2cnadj",
"tags": []
},
"outputs": [],
"source": [
"penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 1565,
"status": "ok",
"timestamp": 1614505363769,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "vnTRg-CyoEHg",
"outputId": "d35e11a0-880c-4758-d9c6-736899f80bb1",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(344, 8)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.shape"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"executionInfo": {
"elapsed": 2066,
"status": "ok",
"timestamp": 1614505364499,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "DCZsRpzbnkoW",
"outputId": "cdaa2e60-3d24-4f20-f47d-0caade5317ef",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.1 | \n",
" 18.7 | \n",
" 181.0 | \n",
" 3750.0 | \n",
" male | \n",
" 2007 | \n",
"
\n",
" \n",
" 1 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.5 | \n",
" 17.4 | \n",
" 186.0 | \n",
" 3800.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 2 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.3 | \n",
" 18.0 | \n",
" 195.0 | \n",
" 3250.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 3 | \n",
" Adelie | \n",
" Torgersen | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2007 | \n",
"
\n",
" \n",
" 4 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.7 | \n",
" 19.3 | \n",
" 193.0 | \n",
" 3450.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"3 Adelie Torgersen NaN NaN NaN \n",
"4 Adelie Torgersen 36.7 19.3 193.0 \n",
"\n",
" body_mass_g sex year \n",
"0 3750.0 male 2007 \n",
"1 3800.0 female 2007 \n",
"2 3250.0 female 2007 \n",
"3 NaN NaN 2007 \n",
"4 3450.0 female 2007 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HCWF9lrYnncQ"
},
"source": [
"Here we see that some values are missing and filled with ```NaN```. As we have seen previously, Pandas is capable of handling ```NaN``` values e.g. when calculating statistics. But sometimes we want to remove those values. We can for example use ```.dropna()``` for this:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "wXxAP6Uvn_IN",
"tags": []
},
"outputs": [],
"source": [
"penguins_clean = penguins.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"executionInfo": {
"elapsed": 1431,
"status": "ok",
"timestamp": 1614505364963,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "LKYIaY3ooBmV",
"outputId": "38c24c5e-9242-4683-9cf2-d743bae43186",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.1 | \n",
" 18.7 | \n",
" 181.0 | \n",
" 3750.0 | \n",
" male | \n",
" 2007 | \n",
"
\n",
" \n",
" 1 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.5 | \n",
" 17.4 | \n",
" 186.0 | \n",
" 3800.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 2 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.3 | \n",
" 18.0 | \n",
" 195.0 | \n",
" 3250.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 4 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.7 | \n",
" 19.3 | \n",
" 193.0 | \n",
" 3450.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 5 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.3 | \n",
" 20.6 | \n",
" 190.0 | \n",
" 3650.0 | \n",
" male | \n",
" 2007 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"4 Adelie Torgersen 36.7 19.3 193.0 \n",
"5 Adelie Torgersen 39.3 20.6 190.0 \n",
"\n",
" body_mass_g sex year \n",
"0 3750.0 male 2007 \n",
"1 3800.0 female 2007 \n",
"2 3250.0 female 2007 \n",
"4 3450.0 female 2007 \n",
"5 3650.0 male 2007 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins_clean.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KciDiGOJoXGn"
},
"source": [
"We see that for example the index 3 has disappeared. ```.dropna()``` drops each line that has an ```NaN``` in any column.\n",
"\n",
"You can also be more precise and check for ```NaN``` in a given column:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 1224,
"status": "ok",
"timestamp": 1614505365771,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "ZXT99zLAaGqh",
"outputId": "fd3dcee8-4af2-4165-ad09-271aeca75889",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 True\n",
"4 False\n",
" ... \n",
"339 False\n",
"340 False\n",
"341 False\n",
"342 False\n",
"343 False\n",
"Name: bill_length_mm, Length: 344, dtype: bool"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.bill_length_mm.isna()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e8ONUqmWaMhA"
},
"source": [
"Using indexing, we can now remove **only lines** which have ```NA``` in that column:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"executionInfo": {
"elapsed": 1220,
"status": "ok",
"timestamp": 1614505366575,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "obBjaiLnaU-7",
"outputId": "378550b2-dbae-4dfa-9f3f-8504fb5402c5",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.1 | \n",
" 18.7 | \n",
" 181.0 | \n",
" 3750.0 | \n",
" male | \n",
" 2007 | \n",
"
\n",
" \n",
" 1 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.5 | \n",
" 17.4 | \n",
" 186.0 | \n",
" 3800.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 2 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.3 | \n",
" 18.0 | \n",
" 195.0 | \n",
" 3250.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 4 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.7 | \n",
" 19.3 | \n",
" 193.0 | \n",
" 3450.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 5 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.3 | \n",
" 20.6 | \n",
" 190.0 | \n",
" 3650.0 | \n",
" male | \n",
" 2007 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"4 Adelie Torgersen 36.7 19.3 193.0 \n",
"5 Adelie Torgersen 39.3 20.6 190.0 \n",
"\n",
" body_mass_g sex year \n",
"0 3750.0 male 2007 \n",
"1 3800.0 female 2007 \n",
"2 3250.0 female 2007 \n",
"4 3450.0 female 2007 \n",
"5 3650.0 male 2007 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins_clean2 = penguins[~penguins.bill_length_mm.isna()]\n",
"penguins_clean2.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zSV-nsx0admX"
},
"source": [
"## Replacing *bad* values\n",
"\n",
"When importing a table, Pandas is capable of recognizing specific values as being ```NaN```. For example the text NaN or missing values are turned in \"offficial\" ```NaN```. But often there are just mistakes in the table. For example:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "3Ns1W95HdLzx",
"tags": []
},
"outputs": [],
"source": [
"composer_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'\n",
"composers = pd.read_excel(composer_url, sheet_name=1)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"executionInfo": {
"elapsed": 1698,
"status": "ok",
"timestamp": 1614505368437,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "-jx4v6LD8nhn",
"outputId": "0a6f5976-6f64-4053-a2f5-9dbd69ac5ed4",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" composer | \n",
" birth | \n",
" death | \n",
" city | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Mahler | \n",
" 1860.0 | \n",
" 1911 | \n",
" Kaliste | \n",
"
\n",
" \n",
" 1 | \n",
" Beethoven | \n",
" 1770.0 | \n",
" 1827 | \n",
" Bonn | \n",
"
\n",
" \n",
" 2 | \n",
" Puccini | \n",
" 1858.0 | \n",
" 1924 | \n",
" Lucques | \n",
"
\n",
" \n",
" 3 | \n",
" Shostakovich | \n",
" 1906.0 | \n",
" 1975 | \n",
" Saint-Petersburg | \n",
"
\n",
" \n",
" 4 | \n",
" Sibelius | \n",
" 10.0 | \n",
" unknown | \n",
" unknown | \n",
"
\n",
" \n",
" 5 | \n",
" Haydn | \n",
" NaN | \n",
" NaN | \n",
" Röhrau | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" composer birth death city\n",
"0 Mahler 1860.0 1911 Kaliste\n",
"1 Beethoven 1770.0 1827 Bonn\n",
"2 Puccini 1858.0 1924 Lucques\n",
"3 Shostakovich 1906.0 1975 Saint-Petersburg\n",
"4 Sibelius 10.0 unknown unknown\n",
"5 Haydn NaN NaN Röhrau"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"composers"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "r3qwJzlodZZY"
},
"source": [
"Here we see that we have both ```NaN``` values and other \"bad\" values in the form of ```uknown``` text. This is a problem because now the columns containing that text are not \"number-columns\" anymore and for example we can't do operations:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 368
},
"executionInfo": {
"elapsed": 805,
"status": "error",
"timestamp": 1614505368437,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "e8THU0978j_V",
"outputId": "a14cd42f-ba74-494f-ffad-0e0e240f7a45",
"tags": []
},
"outputs": [
{
"ename": "TypeError",
"evalue": "loop of ufunc does not support argument 0 of type int which has no callable log method",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;31mAttributeError\u001b[0m: 'int' object has no attribute 'log'",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[11], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlog\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcomposers\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mdeath\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/generic.py:2113\u001b[0m, in \u001b[0;36mNDFrame.__array_ufunc__\u001b[0;34m(self, ufunc, method, *inputs, **kwargs)\u001b[0m\n\u001b[1;32m 2109\u001b[0m \u001b[38;5;129m@final\u001b[39m\n\u001b[1;32m 2110\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__array_ufunc__\u001b[39m(\n\u001b[1;32m 2111\u001b[0m \u001b[38;5;28mself\u001b[39m, ufunc: np\u001b[38;5;241m.\u001b[39mufunc, method: \u001b[38;5;28mstr\u001b[39m, \u001b[38;5;241m*\u001b[39minputs: Any, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: Any\n\u001b[1;32m 2112\u001b[0m ):\n\u001b[0;32m-> 2113\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marraylike\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43marray_ufunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mufunc\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/arraylike.py:402\u001b[0m, in \u001b[0;36marray_ufunc\u001b[0;34m(self, ufunc, method, *inputs, **kwargs)\u001b[0m\n\u001b[1;32m 399\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mndim \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[1;32m 400\u001b[0m \u001b[38;5;66;03m# ufunc(series, ...)\u001b[39;00m\n\u001b[1;32m 401\u001b[0m inputs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mtuple\u001b[39m(extract_array(x, extract_numpy\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m) \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m inputs)\n\u001b[0;32m--> 402\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mgetattr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mufunc\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m)\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 403\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 404\u001b[0m \u001b[38;5;66;03m# ufunc(dataframe)\u001b[39;00m\n\u001b[1;32m 405\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m__call__\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m kwargs:\n\u001b[1;32m 406\u001b[0m \u001b[38;5;66;03m# for np.(..) calls\u001b[39;00m\n\u001b[1;32m 407\u001b[0m \u001b[38;5;66;03m# kwargs cannot necessarily be handled block-by-block, so only\u001b[39;00m\n\u001b[1;32m 408\u001b[0m \u001b[38;5;66;03m# take this path if there are no kwargs\u001b[39;00m\n",
"\u001b[0;31mTypeError\u001b[0m: loop of ufunc does not support argument 0 of type int which has no callable log method"
]
}
],
"source": [
"np.log(composers['death'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "B_egq9LI8t3b"
},
"source": [
"We therefore have first to clean-up the table. One solution is to replace all bade values. For that we can use ```.replace```:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"executionInfo": {
"elapsed": 681,
"status": "ok",
"timestamp": 1614505369387,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "H7ohQK6U86Rv",
"outputId": "03352ec2-57bd-415b-ce32-be54b75e2d54",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" composer | \n",
" birth | \n",
" death | \n",
" city | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Mahler | \n",
" 1860.0 | \n",
" 1911.0 | \n",
" Kaliste | \n",
"
\n",
" \n",
" 1 | \n",
" Beethoven | \n",
" 1770.0 | \n",
" 1827.0 | \n",
" Bonn | \n",
"
\n",
" \n",
" 2 | \n",
" Puccini | \n",
" 1858.0 | \n",
" 1924.0 | \n",
" Lucques | \n",
"
\n",
" \n",
" 3 | \n",
" Shostakovich | \n",
" 1906.0 | \n",
" 1975.0 | \n",
" Saint-Petersburg | \n",
"
\n",
" \n",
" 4 | \n",
" Sibelius | \n",
" 10.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 5 | \n",
" Haydn | \n",
" NaN | \n",
" NaN | \n",
" Röhrau | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" composer birth death city\n",
"0 Mahler 1860.0 1911.0 Kaliste\n",
"1 Beethoven 1770.0 1827.0 Bonn\n",
"2 Puccini 1858.0 1924.0 Lucques\n",
"3 Shostakovich 1906.0 1975.0 Saint-Petersburg\n",
"4 Sibelius 10.0 NaN NaN\n",
"5 Haydn NaN NaN Röhrau"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"composers_clean = composers.replace('unknown', np.nan)\n",
"composers_clean"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 447,
"status": "ok",
"timestamp": 1614505370487,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "6pkqrtZk9ACD",
"outputId": "cafe8ae3-4523-4a4a-ccdc-4944ad33768e",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 7.555382\n",
"1 7.510431\n",
"2 7.562162\n",
"3 7.588324\n",
"4 NaN\n",
"5 NaN\n",
"Name: death, dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.log(composers_clean['death'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MIhMecR79Jo0"
},
"source": [
"Alternatively we can fix this problem at import time as well because we might have a lsit of all \"bad entries\". For that we can use the option"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"executionInfo": {
"elapsed": 653,
"status": "ok",
"timestamp": 1614505372062,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "oYfmjghQ9VRh",
"outputId": "f7a66442-7a7f-40a9-fa04-a5acfee8264d",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" composer | \n",
" birth | \n",
" death | \n",
" city | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Mahler | \n",
" 1860.0 | \n",
" 1911.0 | \n",
" Kaliste | \n",
"
\n",
" \n",
" 1 | \n",
" Beethoven | \n",
" 1770.0 | \n",
" 1827.0 | \n",
" Bonn | \n",
"
\n",
" \n",
" 2 | \n",
" Puccini | \n",
" 1858.0 | \n",
" 1924.0 | \n",
" Lucques | \n",
"
\n",
" \n",
" 3 | \n",
" Shostakovich | \n",
" 1906.0 | \n",
" 1975.0 | \n",
" Saint-Petersburg | \n",
"
\n",
" \n",
" 4 | \n",
" Sibelius | \n",
" 10.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 5 | \n",
" Haydn | \n",
" NaN | \n",
" NaN | \n",
" Röhrau | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" composer birth death city\n",
"0 Mahler 1860.0 1911.0 Kaliste\n",
"1 Beethoven 1770.0 1827.0 Bonn\n",
"2 Puccini 1858.0 1924.0 Lucques\n",
"3 Shostakovich 1906.0 1975.0 Saint-Petersburg\n",
"4 Sibelius 10.0 NaN NaN\n",
"5 Haydn NaN NaN Röhrau"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_excel(composer_url, na_values=['unknown'], sheet_name=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0Hf1hMjx9qk7"
},
"source": [
"## Dealing with time\n",
"\n",
"Pandas offers many possibilities to help dealing with data containing time, e.g. to calculate how many days have passed between two dates."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"id": "yfidLUDaJ5jO",
"tags": []
},
"outputs": [],
"source": [
"composers = pd.read_excel(composer_url)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"executionInfo": {
"elapsed": 686,
"status": "ok",
"timestamp": 1614505634547,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "97miVvL5J9VD",
"outputId": "073c4d19-9937-4ab6-d576-68a11f6186bb",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" composer | \n",
" birth | \n",
" death | \n",
" city | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Mahler | \n",
" 1860 | \n",
" 1911 | \n",
" Kaliste | \n",
"
\n",
" \n",
" 1 | \n",
" Beethoven | \n",
" 1770 | \n",
" 1827 | \n",
" Bonn | \n",
"
\n",
" \n",
" 2 | \n",
" Puccini | \n",
" 1858 | \n",
" 1924 | \n",
" Lucques | \n",
"
\n",
" \n",
" 3 | \n",
" Shostakovich | \n",
" 1906 | \n",
" 1975 | \n",
" Saint-Petersburg | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" composer birth death city\n",
"0 Mahler 1860 1911 Kaliste\n",
"1 Beethoven 1770 1827 Bonn\n",
"2 Puccini 1858 1924 Lucques\n",
"3 Shostakovich 1906 1975 Saint-Petersburg"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"composers"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5e0RLSNkBn5X"
},
"source": [
"We transform now the ```birth``` and ```death``` columns into ```datetime``` objects using ```pd.to_datetime``` and by specifying a format, here just a year ```%Y```:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"executionInfo": {
"elapsed": 476,
"status": "ok",
"timestamp": 1614505634913,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "ambmnTlRI8KR",
"outputId": "2f574dcd-5160-498e-d8b3-61b89078da6c",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" composer | \n",
" birth | \n",
" death | \n",
" city | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Mahler | \n",
" 1860-01-01 | \n",
" 1911-01-01 | \n",
" Kaliste | \n",
"
\n",
" \n",
" 1 | \n",
" Beethoven | \n",
" 1770-01-01 | \n",
" 1827-01-01 | \n",
" Bonn | \n",
"
\n",
" \n",
" 2 | \n",
" Puccini | \n",
" 1858-01-01 | \n",
" 1924-01-01 | \n",
" Lucques | \n",
"
\n",
" \n",
" 3 | \n",
" Shostakovich | \n",
" 1906-01-01 | \n",
" 1975-01-01 | \n",
" Saint-Petersburg | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" composer birth death city\n",
"0 Mahler 1860-01-01 1911-01-01 Kaliste\n",
"1 Beethoven 1770-01-01 1827-01-01 Bonn\n",
"2 Puccini 1858-01-01 1924-01-01 Lucques\n",
"3 Shostakovich 1906-01-01 1975-01-01 Saint-Petersburg"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"composers['birth'] = pd.to_datetime(composers['birth'], format = '%Y')\n",
"composers['death'] = pd.to_datetime(composers['death'], format = '%Y')\n",
"composers"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5XyQR8OmB9EJ"
},
"source": [
"You can see that the year has been turned into an actual date (using 1st January as default). We create now an age column:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"executionInfo": {
"elapsed": 785,
"status": "ok",
"timestamp": 1614505666064,
"user": {
"displayName": "Guillaume Witz",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
"userId": "16033870147214403532"
},
"user_tz": -60
},
"id": "xwgvIrdIJDH3",
"outputId": "d6259546-67fa-4e95-c7bb-13f707957518",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" composer | \n",
" birth | \n",
" death | \n",
" city | \n",
" age | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Mahler | \n",
" 1860-01-01 | \n",
" 1911-01-01 | \n",
" Kaliste | \n",
" 18627 days | \n",
"
\n",
" \n",
" 1 | \n",
" Beethoven | \n",
" 1770-01-01 | \n",
" 1827-01-01 | \n",
" Bonn | \n",
" 20818 days | \n",
"
\n",
" \n",
" 2 | \n",
" Puccini | \n",
" 1858-01-01 | \n",
" 1924-01-01 | \n",
" Lucques | \n",
" 24105 days | \n",
"
\n",
" \n",
" 3 | \n",
" Shostakovich | \n",
" 1906-01-01 | \n",
" 1975-01-01 | \n",
" Saint-Petersburg | \n",
" 25202 days | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" composer birth death city age\n",
"0 Mahler 1860-01-01 1911-01-01 Kaliste 18627 days\n",
"1 Beethoven 1770-01-01 1827-01-01 Bonn 20818 days\n",
"2 Puccini 1858-01-01 1924-01-01 Lucques 24105 days\n",
"3 Shostakovich 1906-01-01 1975-01-01 Saint-Petersburg 25202 days"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"composers['age'] = (composers.death - composers.birth)\n",
"composers"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XC4LJx9lJy0R"
},
"source": [
"Since the columns are datetime objects, Pandas know that it should actually compute difference in dates and by defaults gives days. Here the computation is trivial, but if you deal with formats such as ```HH:MM:SS``` this is very useful."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4JvfNZ5wCumv"
},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"authorship_tag": "ABX9TyODbHn5Qz1JFAEv8MvYp96r",
"collapsed_sections": [],
"name": "10-Real_world.ipynb",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}