{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "TD06UcxH2Aap" }, "source": [ "# 8. Operating on DataFrames\n", "\n", "We have seen in the very first chapter that we could easily import CSV or Excel sheets as DataFrames in Python. We have also seen that those dataframes are essentially two-dimensional tables where each element can be located via an index and a column name. We have also seen that each column is in fact a Numpy array and have learned how to handle and process those arrays.\n", "\n", "Here we will show how most of the concepts we learned for Numpy are directly translateable into Pandas." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "executionInfo": { "elapsed": 722, "status": "ok", "timestamp": 1614349113268, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "PppCId1k23Bb", "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "id": "ohXAnizD235F" }, "source": [ "We here again use our composer data:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "executionInfo": { "elapsed": 2000, "status": "ok", "timestamp": 1614349123619, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "IE2lYZGo3ywH", "tags": [] }, "outputs": [], "source": [ "composers = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 1604, "status": "ok", "timestamp": 1614349123620, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "M-T0Ua8J31xZ", "outputId": "1f850085-1c6f-4c7c-a42b-bf1d1cfbbefe", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcity
0Mahler18601911Kaliste
1Beethoven17701827Bonn
2Puccini18581924Lucques
3Shostakovich19061975Saint-Petersburg
\n", "
" ], "text/plain": [ " composer birth death city\n", "0 Mahler 1860 1911 Kaliste\n", "1 Beethoven 1770 1827 Bonn\n", "2 Puccini 1858 1924 Lucques\n", "3 Shostakovich 1906 1975 Saint-Petersburg" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers" ] }, { "cell_type": "markdown", "metadata": { "id": "sHx_oGrs37bT" }, "source": [ "Just like with Numpy arrays, we can apply various mathematical operations on DataFrames columns or entire DataFrames. " ] }, { "cell_type": "markdown", "metadata": { "id": "HbHCGJNEtO3W" }, "source": [ "## Arithmetics single columns\n", "Just like we could do an operation like ```my_array = my_array + 1```, we can do the same with Pandas columns. Doing such operations, we directly reaplace an existing column or create new one like here:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 466, "status": "ok", "timestamp": 1614349212760, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "sMZmTQ894M4M", "outputId": "1748c398-4385-4aed-970f-e17a2825bc48", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcitydeath_100
0Mahler18601911Kaliste2011
1Beethoven17701827Bonn1927
2Puccini18581924Lucques2024
3Shostakovich19061975Saint-Petersburg2075
\n", "
" ], "text/plain": [ " composer birth death city death_100\n", "0 Mahler 1860 1911 Kaliste 2011\n", "1 Beethoven 1770 1827 Bonn 1927\n", "2 Puccini 1858 1924 Lucques 2024\n", "3 Shostakovich 1906 1975 Saint-Petersburg 2075" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers['death_100'] = composers['death'] + 100\n", "composers" ] }, { "cell_type": "markdown", "metadata": { "id": "0Ra395dx4Qb0" }, "source": [ "### Arithmetics multiple columns\n", "\n", "We can also do maths with multiple columns as we have previously done with arrays. Just like with arrays, **operations are performed element-wise**. We can for example compute the composers' age at death:\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 544, "status": "ok", "timestamp": 1614349226184, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "ijOU4ik34qoB", "outputId": "b32ef230-f677-4b13-9257-2cd98e2882d0", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 51\n", "1 57\n", "2 66\n", "3 69\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers['death']-composers['birth']" ] }, { "cell_type": "markdown", "metadata": { "id": "WfEsNtkCZVZd" }, "source": [ "Note that the output here is a series with the correct length, so we can simply add it to our original table directly!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 531, "status": "ok", "timestamp": 1614349231334, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "Ie2iIPfNZfCf", "outputId": "d6b800ed-b07f-4a45-8c7b-13ae768f453d", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcitydeath_100age
0Mahler18601911Kaliste201151
1Beethoven17701827Bonn192757
2Puccini18581924Lucques202466
3Shostakovich19061975Saint-Petersburg207569
\n", "
" ], "text/plain": [ " composer birth death city death_100 age\n", "0 Mahler 1860 1911 Kaliste 2011 51\n", "1 Beethoven 1770 1827 Bonn 1927 57\n", "2 Puccini 1858 1924 Lucques 2024 66\n", "3 Shostakovich 1906 1975 Saint-Petersburg 2075 69" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers['age'] = composers['death']-composers['birth']\n", "composers" ] }, { "cell_type": "markdown", "metadata": { "id": "ZmvIVmKj4tjy" }, "source": [ "## Applying functions\n", "\n", "Again, just like we could previously apply a function to an array and recover an array of the same size where the function had been applied **element-wise** we can do that same with a DataFrame column and recover a column. And we can simply use Numpy functions!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 497, "status": "ok", "timestamp": 1614349243384, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "3SPc0aEJ58QH", "outputId": "8d71ab22-75b7-4196-c4e6-72ac9fe345be", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 3.269513\n", "1 3.247973\n", "2 3.269046\n", "3 3.280123\n", "Name: birth, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.log10(composers['birth'])" ] }, { "cell_type": "markdown", "metadata": { "id": "GtTVjZyRts_w" }, "source": [ "## Applying array wide\n", "\n", "Not that you should be careful if you want to apply a function to more than one column. For example if we want to apply the above operation to the entire array:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 249 }, "executionInfo": { "elapsed": 543, "status": "error", "timestamp": 1614349303728, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "zyvq9zDzt2Jb", "outputId": "79d527ec-466d-4741-87d6-4f1f3e326c4e", "tags": [] }, "outputs": [ { "ename": "TypeError", "evalue": "loop of ufunc does not support argument 0 of type str which has no callable log method", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;31mAttributeError\u001b[0m: 'str' object has no attribute 'log'", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[8], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlog\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcomposers\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/generic.py:2113\u001b[0m, in \u001b[0;36mNDFrame.__array_ufunc__\u001b[0;34m(self, ufunc, method, *inputs, **kwargs)\u001b[0m\n\u001b[1;32m 2109\u001b[0m \u001b[38;5;129m@final\u001b[39m\n\u001b[1;32m 2110\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__array_ufunc__\u001b[39m(\n\u001b[1;32m 2111\u001b[0m \u001b[38;5;28mself\u001b[39m, ufunc: np\u001b[38;5;241m.\u001b[39mufunc, method: \u001b[38;5;28mstr\u001b[39m, \u001b[38;5;241m*\u001b[39minputs: Any, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: Any\n\u001b[1;32m 2112\u001b[0m ):\n\u001b[0;32m-> 2113\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marraylike\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43marray_ufunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mufunc\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/arraylike.py:410\u001b[0m, in \u001b[0;36marray_ufunc\u001b[0;34m(self, ufunc, method, *inputs, **kwargs)\u001b[0m\n\u001b[1;32m 405\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m__call__\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m kwargs:\n\u001b[1;32m 406\u001b[0m \u001b[38;5;66;03m# for np.(..) calls\u001b[39;00m\n\u001b[1;32m 407\u001b[0m \u001b[38;5;66;03m# kwargs cannot necessarily be handled block-by-block, so only\u001b[39;00m\n\u001b[1;32m 408\u001b[0m \u001b[38;5;66;03m# take this path if there are no kwargs\u001b[39;00m\n\u001b[1;32m 409\u001b[0m mgr \u001b[38;5;241m=\u001b[39m inputs[\u001b[38;5;241m0\u001b[39m]\u001b[38;5;241m.\u001b[39m_mgr\n\u001b[0;32m--> 410\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[43mmgr\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mgetattr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mufunc\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 411\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 412\u001b[0m \u001b[38;5;66;03m# otherwise specific ufunc methods (eg np..accumulate(..))\u001b[39;00m\n\u001b[1;32m 413\u001b[0m \u001b[38;5;66;03m# Those can have an axis keyword and thus can't be called block-by-block\u001b[39;00m\n\u001b[1;32m 414\u001b[0m result \u001b[38;5;241m=\u001b[39m default_array_ufunc(inputs[\u001b[38;5;241m0\u001b[39m], ufunc, method, \u001b[38;5;241m*\u001b[39minputs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n", "File \u001b[0;32m~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/internals/managers.py:350\u001b[0m, in \u001b[0;36mBaseBlockManager.apply\u001b[0;34m(self, f, align_keys, ignore_failures, **kwargs)\u001b[0m\n\u001b[1;32m 348\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 349\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m callable(f):\n\u001b[0;32m--> 350\u001b[0m applied \u001b[38;5;241m=\u001b[39m \u001b[43mb\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 351\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 352\u001b[0m applied \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mgetattr\u001b[39m(b, f)(\u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n", "File \u001b[0;32m~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/internals/blocks.py:351\u001b[0m, in \u001b[0;36mBlock.apply\u001b[0;34m(self, func, **kwargs)\u001b[0m\n\u001b[1;32m 345\u001b[0m \u001b[38;5;129m@final\u001b[39m\n\u001b[1;32m 346\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mapply\u001b[39m(\u001b[38;5;28mself\u001b[39m, func, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mlist\u001b[39m[Block]:\n\u001b[1;32m 347\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 348\u001b[0m \u001b[38;5;124;03m apply the function to my values; return a block if we are not\u001b[39;00m\n\u001b[1;32m 349\u001b[0m \u001b[38;5;124;03m one\u001b[39;00m\n\u001b[1;32m 350\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 351\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalues\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 353\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_split_op_result(result)\n", "\u001b[0;31mTypeError\u001b[0m: loop of ufunc does not support argument 0 of type str which has no callable log method" ] } ], "source": [ "np.log(composers)" ] }, { "cell_type": "markdown", "metadata": { "id": "83cz99D6t4Ox" }, "source": [ "We get an error because some columns are **not numerical** and therefore Pandas doesn't know what to do. If we only operate on numerical columns this however works:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 583, "status": "ok", "timestamp": 1614349363686, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "W0XgXednuByx", "outputId": "6547dd41-e2f8-4a73-80a8-121e04f29761", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeath
018601911
117701827
218581924
319061975
\n", "
" ], "text/plain": [ " birth death\n", "0 1860 1911\n", "1 1770 1827\n", "2 1858 1924\n", "3 1906 1975" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sub_df = composers[['birth', 'death']]\n", "sub_df" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 534, "status": "ok", "timestamp": 1614349372379, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "NAwWKMq5uGwa", "outputId": "f288b080-56ab-4864-e4f7-06ee325fd46e", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeath
03.2695133.281261
13.2479733.261739
23.2690463.284205
33.2801233.295567
\n", "
" ], "text/plain": [ " birth death\n", "0 3.269513 3.281261\n", "1 3.247973 3.261739\n", "2 3.269046 3.284205\n", "3 3.280123 3.295567" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.log10(sub_df)" ] }, { "cell_type": "markdown", "metadata": { "id": "cmW_C6DtZvqA" }, "source": [ "## The ```apply``` function\n", "\n", "There is also a more general way of applying **custom functions** using the ```apply()``` function. There are many ways of using this function: on the full DataFrame, on single columns, with a single output value, a series etc. Here we look at the simplest case, where we apply a given function to each element of a given column, thus generating a new *modified* column.\n", "\n", "First we need to define a custom function. Here we want a function to tell us whether a composer is born before or after 1900 by looking at the ```birth``` column. So we define a standard function that takes a year as input and returns True or False:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "executionInfo": { "elapsed": 630, "status": "ok", "timestamp": 1614349600530, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "UKxF4zklbSch", "tags": [] }, "outputs": [], "source": [ "def age_fun(age):\n", " if age < 1900:\n", " return False\n", " else:\n", " return True" ] }, { "cell_type": "markdown", "metadata": { "id": "MAK29LitbbT-" }, "source": [ "Let's test it:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 632, "status": "ok", "timestamp": 1614349602992, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "7_Dmazwobdvr", "outputId": "8193a749-42f6-4533-f083-06231bd4ba7e", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age_fun(1700)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 648, "status": "ok", "timestamp": 1614349603293, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "tPFgSrwBbf3E", "outputId": "6b824e2e-d1e7-47bc-d5c9-9be7898beeab", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age_fun(1950)" ] }, { "cell_type": "markdown", "metadata": { "id": "B3iXvKscbhmi" }, "source": [ "Now we simply pass this function to the ```apply``` method on a specific column:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 593, "status": "ok", "timestamp": 1614349605341, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "59R2Gyjubsu6", "outputId": "35b78608-71be-42ac-8602-f20ac12d28e5", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 True\n", "Name: birth, dtype: bool" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers['birth'].apply(age_fun)" ] }, { "cell_type": "markdown", "metadata": { "id": "tRcncZvmbxcX" }, "source": [ "Again we can directly create a new columns with this output:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 637, "status": "ok", "timestamp": 1614349608516, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "5bp_L05sb39A", "outputId": "952aefc4-56fe-444f-b6f2-c01c112bfc12", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcitydeath_100age20th century
0Mahler18601911Kaliste201151False
1Beethoven17701827Bonn192757False
2Puccini18581924Lucques202466False
3Shostakovich19061975Saint-Petersburg207569True
\n", "
" ], "text/plain": [ " composer birth death city death_100 age 20th century\n", "0 Mahler 1860 1911 Kaliste 2011 51 False\n", "1 Beethoven 1770 1827 Bonn 1927 57 False\n", "2 Puccini 1858 1924 Lucques 2024 66 False\n", "3 Shostakovich 1906 1975 Saint-Petersburg 2075 69 True" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers['20th century'] = composers['birth'].apply(age_fun)\n", "composers" ] }, { "cell_type": "markdown", "metadata": { "id": "1QIL-5Ff6Ljs" }, "source": [ "## Summarizing functions\n", "Finally, we can summarized columns by using either Numpy function or the methods attached to a Dataframe or a column. For example we can take the mean:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 479, "status": "ok", "timestamp": 1614349618875, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "fBLc13Vv6fXm", "outputId": "c34bf20c-b038-4b23-c980-ae1ee137e736", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1848.5" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(composers['birth'])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 829, "status": "ok", "timestamp": 1614349619785, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "56rd355o6hwx", "outputId": "537a25e1-c699-4da9-b2f3-1c7750e0f85f", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1848.5" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers['birth'].mean()" ] }, { "cell_type": "markdown", "metadata": { "id": "611XOtz_6kJp" }, "source": [ "## Exercise\n", "\n", "1. Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv\n", "2. Create a new columns called ```body_mass_kg``` by transforming the ```body_mass_g``` column into kilograms.\n", "3. Create a new column containiing the division the ```bill_length_mm``` by the ```bill_depth_mm``` column\n", "4. Calculate the median value of the ```body_mass_g``` column both by using a Numpy function and by using the method attached to the column. What do you observe? Can you infer what the problem is?" ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyOW6mzLgtxOfld0ibM0SUYj", "collapsed_sections": [], "name": "07-Operate_on_DataFrames.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 4 }