{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "2KGt0M0jmhmT" }, "source": [ "# 11. Real-world problems and additional topics\n", "\n", "During this course we have either used simple demo examples designed to illustrate specific aspects of Pandas, or real-world data where we ignored some of the potentials problems. In this notebook, we address a few of the common problems that you might encounter. We also address some other common situations that you might encounter such as dealing with specific formats like dates." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "ZOGIWe8fm-lT", "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "id": "nO7Di3iDm7Y1" }, "source": [ "## Absent values\n", "\n", "Very often, data that you need are imperfect: they migth have missing values or some values might be \"incorrect\", e.g. in a column supposed to contain numbers you suddently encounter text. Pandas offers many ways to fix these problems.\n", "\n", "Let's once again import our penguin data:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "5m76IE2cnadj", "tags": [] }, "outputs": [], "source": [ "penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1565, "status": "ok", "timestamp": 1614505363769, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "vnTRg-CyoEHg", "outputId": "d35e11a0-880c-4758-d9c6-736899f80bb1", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "(344, 8)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 2066, "status": "ok", "timestamp": 1614505364499, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "DCZsRpzbnkoW", "outputId": "cdaa2e60-3d24-4f20-f47d-0caade5317ef", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
0AdelieTorgersen39.118.7181.03750.0male2007
1AdelieTorgersen39.517.4186.03800.0female2007
2AdelieTorgersen40.318.0195.03250.0female2007
3AdelieTorgersenNaNNaNNaNNaNNaN2007
4AdelieTorgersen36.719.3193.03450.0female2007
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 39.1 18.7 181.0 \n", "1 Adelie Torgersen 39.5 17.4 186.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "3 Adelie Torgersen NaN NaN NaN \n", "4 Adelie Torgersen 36.7 19.3 193.0 \n", "\n", " body_mass_g sex year \n", "0 3750.0 male 2007 \n", "1 3800.0 female 2007 \n", "2 3250.0 female 2007 \n", "3 NaN NaN 2007 \n", "4 3450.0 female 2007 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.head(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "HCWF9lrYnncQ" }, "source": [ "Here we see that some values are missing and filled with ```NaN```. As we have seen previously, Pandas is capable of handling ```NaN``` values e.g. when calculating statistics. But sometimes we want to remove those values. We can for example use ```.dropna()``` for this:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "wXxAP6Uvn_IN", "tags": [] }, "outputs": [], "source": [ "penguins_clean = penguins.dropna()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 1431, "status": "ok", "timestamp": 1614505364963, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "LKYIaY3ooBmV", "outputId": "38c24c5e-9242-4683-9cf2-d743bae43186", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
0AdelieTorgersen39.118.7181.03750.0male2007
1AdelieTorgersen39.517.4186.03800.0female2007
2AdelieTorgersen40.318.0195.03250.0female2007
4AdelieTorgersen36.719.3193.03450.0female2007
5AdelieTorgersen39.320.6190.03650.0male2007
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 39.1 18.7 181.0 \n", "1 Adelie Torgersen 39.5 17.4 186.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "4 Adelie Torgersen 36.7 19.3 193.0 \n", "5 Adelie Torgersen 39.3 20.6 190.0 \n", "\n", " body_mass_g sex year \n", "0 3750.0 male 2007 \n", "1 3800.0 female 2007 \n", "2 3250.0 female 2007 \n", "4 3450.0 female 2007 \n", "5 3650.0 male 2007 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins_clean.head(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "KciDiGOJoXGn" }, "source": [ "We see that for example the index 3 has disappeared. ```.dropna()``` drops each line that has an ```NaN``` in any column.\n", "\n", "You can also be more precise and check for ```NaN``` in a given column:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1224, "status": "ok", "timestamp": 1614505365771, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "ZXT99zLAaGqh", "outputId": "fd3dcee8-4af2-4165-ad09-271aeca75889", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 True\n", "4 False\n", " ... \n", "339 False\n", "340 False\n", "341 False\n", "342 False\n", "343 False\n", "Name: bill_length_mm, Length: 344, dtype: bool" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.bill_length_mm.isna()" ] }, { "cell_type": "markdown", "metadata": { "id": "e8ONUqmWaMhA" }, "source": [ "Using indexing, we can now remove **only lines** which have ```NA``` in that column:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 1220, "status": "ok", "timestamp": 1614505366575, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "obBjaiLnaU-7", "outputId": "378550b2-dbae-4dfa-9f3f-8504fb5402c5", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
0AdelieTorgersen39.118.7181.03750.0male2007
1AdelieTorgersen39.517.4186.03800.0female2007
2AdelieTorgersen40.318.0195.03250.0female2007
4AdelieTorgersen36.719.3193.03450.0female2007
5AdelieTorgersen39.320.6190.03650.0male2007
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 39.1 18.7 181.0 \n", "1 Adelie Torgersen 39.5 17.4 186.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "4 Adelie Torgersen 36.7 19.3 193.0 \n", "5 Adelie Torgersen 39.3 20.6 190.0 \n", "\n", " body_mass_g sex year \n", "0 3750.0 male 2007 \n", "1 3800.0 female 2007 \n", "2 3250.0 female 2007 \n", "4 3450.0 female 2007 \n", "5 3650.0 male 2007 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins_clean2 = penguins[~penguins.bill_length_mm.isna()]\n", "penguins_clean2.head(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "zSV-nsx0admX" }, "source": [ "## Replacing *bad* values\n", "\n", "When importing a table, Pandas is capable of recognizing specific values as being ```NaN```. For example the text NaN or missing values are turned in \"offficial\" ```NaN```. But often there are just mistakes in the table. For example:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "3Ns1W95HdLzx", "tags": [] }, "outputs": [], "source": [ "composer_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'\n", "composers = pd.read_excel(composer_url, sheet_name=1)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "executionInfo": { "elapsed": 1698, "status": "ok", "timestamp": 1614505368437, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "-jx4v6LD8nhn", "outputId": "0a6f5976-6f64-4053-a2f5-9dbd69ac5ed4", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcity
0Mahler1860.01911Kaliste
1Beethoven1770.01827Bonn
2Puccini1858.01924Lucques
3Shostakovich1906.01975Saint-Petersburg
4Sibelius10.0unknownunknown
5HaydnNaNNaNRöhrau
\n", "
" ], "text/plain": [ " composer birth death city\n", "0 Mahler 1860.0 1911 Kaliste\n", "1 Beethoven 1770.0 1827 Bonn\n", "2 Puccini 1858.0 1924 Lucques\n", "3 Shostakovich 1906.0 1975 Saint-Petersburg\n", "4 Sibelius 10.0 unknown unknown\n", "5 Haydn NaN NaN Röhrau" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers" ] }, { "cell_type": "markdown", "metadata": { "id": "r3qwJzlodZZY" }, "source": [ "Here we see that we have both ```NaN``` values and other \"bad\" values in the form of ```uknown``` text. This is a problem because now the columns containing that text are not \"number-columns\" anymore and for example we can't do operations:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 368 }, "executionInfo": { "elapsed": 805, "status": "error", "timestamp": 1614505368437, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "e8THU0978j_V", "outputId": "a14cd42f-ba74-494f-ffad-0e0e240f7a45", "tags": [] }, "outputs": [ { "ename": "TypeError", "evalue": "loop of ufunc does not support argument 0 of type int which has no callable log method", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;31mAttributeError\u001b[0m: 'int' object has no attribute 'log'", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[11], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlog\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcomposers\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mdeath\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/generic.py:2113\u001b[0m, in \u001b[0;36mNDFrame.__array_ufunc__\u001b[0;34m(self, ufunc, method, *inputs, **kwargs)\u001b[0m\n\u001b[1;32m 2109\u001b[0m \u001b[38;5;129m@final\u001b[39m\n\u001b[1;32m 2110\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__array_ufunc__\u001b[39m(\n\u001b[1;32m 2111\u001b[0m \u001b[38;5;28mself\u001b[39m, ufunc: np\u001b[38;5;241m.\u001b[39mufunc, method: \u001b[38;5;28mstr\u001b[39m, \u001b[38;5;241m*\u001b[39minputs: Any, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: Any\n\u001b[1;32m 2112\u001b[0m ):\n\u001b[0;32m-> 2113\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marraylike\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43marray_ufunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mufunc\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/arraylike.py:402\u001b[0m, in \u001b[0;36marray_ufunc\u001b[0;34m(self, ufunc, method, *inputs, **kwargs)\u001b[0m\n\u001b[1;32m 399\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mndim \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[1;32m 400\u001b[0m \u001b[38;5;66;03m# ufunc(series, ...)\u001b[39;00m\n\u001b[1;32m 401\u001b[0m inputs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mtuple\u001b[39m(extract_array(x, extract_numpy\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m) \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m inputs)\n\u001b[0;32m--> 402\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mgetattr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mufunc\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m)\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 403\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 404\u001b[0m \u001b[38;5;66;03m# ufunc(dataframe)\u001b[39;00m\n\u001b[1;32m 405\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m__call__\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m kwargs:\n\u001b[1;32m 406\u001b[0m \u001b[38;5;66;03m# for np.(..) calls\u001b[39;00m\n\u001b[1;32m 407\u001b[0m \u001b[38;5;66;03m# kwargs cannot necessarily be handled block-by-block, so only\u001b[39;00m\n\u001b[1;32m 408\u001b[0m \u001b[38;5;66;03m# take this path if there are no kwargs\u001b[39;00m\n", "\u001b[0;31mTypeError\u001b[0m: loop of ufunc does not support argument 0 of type int which has no callable log method" ] } ], "source": [ "np.log(composers['death'])" ] }, { "cell_type": "markdown", "metadata": { "id": "B_egq9LI8t3b" }, "source": [ "We therefore have first to clean-up the table. One solution is to replace all bade values. For that we can use ```.replace```:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "executionInfo": { "elapsed": 681, "status": "ok", "timestamp": 1614505369387, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "H7ohQK6U86Rv", "outputId": "03352ec2-57bd-415b-ce32-be54b75e2d54", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcity
0Mahler1860.01911.0Kaliste
1Beethoven1770.01827.0Bonn
2Puccini1858.01924.0Lucques
3Shostakovich1906.01975.0Saint-Petersburg
4Sibelius10.0NaNNaN
5HaydnNaNNaNRöhrau
\n", "
" ], "text/plain": [ " composer birth death city\n", "0 Mahler 1860.0 1911.0 Kaliste\n", "1 Beethoven 1770.0 1827.0 Bonn\n", "2 Puccini 1858.0 1924.0 Lucques\n", "3 Shostakovich 1906.0 1975.0 Saint-Petersburg\n", "4 Sibelius 10.0 NaN NaN\n", "5 Haydn NaN NaN Röhrau" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers_clean = composers.replace('unknown', np.nan)\n", "composers_clean" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 447, "status": "ok", "timestamp": 1614505370487, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "6pkqrtZk9ACD", "outputId": "cafe8ae3-4523-4a4a-ccdc-4944ad33768e", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 7.555382\n", "1 7.510431\n", "2 7.562162\n", "3 7.588324\n", "4 NaN\n", "5 NaN\n", "Name: death, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.log(composers_clean['death'])" ] }, { "cell_type": "markdown", "metadata": { "id": "MIhMecR79Jo0" }, "source": [ "Alternatively we can fix this problem at import time as well because we might have a lsit of all \"bad entries\". For that we can use the option" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "executionInfo": { "elapsed": 653, "status": "ok", "timestamp": 1614505372062, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "oYfmjghQ9VRh", "outputId": "f7a66442-7a7f-40a9-fa04-a5acfee8264d", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcity
0Mahler1860.01911.0Kaliste
1Beethoven1770.01827.0Bonn
2Puccini1858.01924.0Lucques
3Shostakovich1906.01975.0Saint-Petersburg
4Sibelius10.0NaNNaN
5HaydnNaNNaNRöhrau
\n", "
" ], "text/plain": [ " composer birth death city\n", "0 Mahler 1860.0 1911.0 Kaliste\n", "1 Beethoven 1770.0 1827.0 Bonn\n", "2 Puccini 1858.0 1924.0 Lucques\n", "3 Shostakovich 1906.0 1975.0 Saint-Petersburg\n", "4 Sibelius 10.0 NaN NaN\n", "5 Haydn NaN NaN Röhrau" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_excel(composer_url, na_values=['unknown'], sheet_name=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "0Hf1hMjx9qk7" }, "source": [ "## Dealing with time\n", "\n", "Pandas offers many possibilities to help dealing with data containing time, e.g. to calculate how many days have passed between two dates." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "yfidLUDaJ5jO", "tags": [] }, "outputs": [], "source": [ "composers = pd.read_excel(composer_url)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 686, "status": "ok", "timestamp": 1614505634547, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "97miVvL5J9VD", "outputId": "073c4d19-9937-4ab6-d576-68a11f6186bb", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcity
0Mahler18601911Kaliste
1Beethoven17701827Bonn
2Puccini18581924Lucques
3Shostakovich19061975Saint-Petersburg
\n", "
" ], "text/plain": [ " composer birth death city\n", "0 Mahler 1860 1911 Kaliste\n", "1 Beethoven 1770 1827 Bonn\n", "2 Puccini 1858 1924 Lucques\n", "3 Shostakovich 1906 1975 Saint-Petersburg" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers" ] }, { "cell_type": "markdown", "metadata": { "id": "5e0RLSNkBn5X" }, "source": [ "We transform now the ```birth``` and ```death``` columns into ```datetime``` objects using ```pd.to_datetime``` and by specifying a format, here just a year ```%Y```:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 476, "status": "ok", "timestamp": 1614505634913, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "ambmnTlRI8KR", "outputId": "2f574dcd-5160-498e-d8b3-61b89078da6c", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcity
0Mahler1860-01-011911-01-01Kaliste
1Beethoven1770-01-011827-01-01Bonn
2Puccini1858-01-011924-01-01Lucques
3Shostakovich1906-01-011975-01-01Saint-Petersburg
\n", "
" ], "text/plain": [ " composer birth death city\n", "0 Mahler 1860-01-01 1911-01-01 Kaliste\n", "1 Beethoven 1770-01-01 1827-01-01 Bonn\n", "2 Puccini 1858-01-01 1924-01-01 Lucques\n", "3 Shostakovich 1906-01-01 1975-01-01 Saint-Petersburg" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers['birth'] = pd.to_datetime(composers['birth'], format = '%Y')\n", "composers['death'] = pd.to_datetime(composers['death'], format = '%Y')\n", "composers" ] }, { "cell_type": "markdown", "metadata": { "id": "5XyQR8OmB9EJ" }, "source": [ "You can see that the year has been turned into an actual date (using 1st January as default). We create now an age column:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 785, "status": "ok", "timestamp": 1614505666064, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "xwgvIrdIJDH3", "outputId": "d6259546-67fa-4e95-c7bb-13f707957518", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcityage
0Mahler1860-01-011911-01-01Kaliste18627 days
1Beethoven1770-01-011827-01-01Bonn20818 days
2Puccini1858-01-011924-01-01Lucques24105 days
3Shostakovich1906-01-011975-01-01Saint-Petersburg25202 days
\n", "
" ], "text/plain": [ " composer birth death city age\n", "0 Mahler 1860-01-01 1911-01-01 Kaliste 18627 days\n", "1 Beethoven 1770-01-01 1827-01-01 Bonn 20818 days\n", "2 Puccini 1858-01-01 1924-01-01 Lucques 24105 days\n", "3 Shostakovich 1906-01-01 1975-01-01 Saint-Petersburg 25202 days" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers['age'] = (composers.death - composers.birth)\n", "composers" ] }, { "cell_type": "markdown", "metadata": { "id": "XC4LJx9lJy0R" }, "source": [ "Since the columns are datetime objects, Pandas know that it should actually compute difference in dates and by defaults gives days. Here the computation is trivial, but if you deal with formats such as ```HH:MM:SS``` this is very useful." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4JvfNZ5wCumv" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyODbHn5Qz1JFAEv8MvYp96r", "collapsed_sections": [], "name": "10-Real_world.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 4 }