{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "kjwMVdVbkr0P" }, "source": [ "# 10. Combining and grouping DataFrames" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "2mM-peDakr0T", "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "id": "jAUEMdaLkr0U" }, "source": [ "Two of the most important operations one can do with DataFrames is 1) combine multiple sources of information and 2) efficiently group and summarize information. Both of these topics are vast and here again, we ony present a few important approaches." ] }, { "cell_type": "markdown", "metadata": { "id": "co9Kh2lakr0U" }, "source": [ "## Concatenation\n", "\n", "The simplest case where we need to combine multiple sources of information is if those sources are of the same \"type\", i.e. they have the same columns but different entries. Imagine for example lists of participants to an event broken in muliple files with names starting with A-K, L-S etc. In that case we can just \"glue\" datasets together or more formally **concatenate** them. Here we load to Excel sheets with composer information:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 1548, "status": "ok", "timestamp": 1614355561199, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "wnMcwQe7kr0U", "outputId": "5e3b58ad-7265-4b5f-b5e1-fe1391f55763", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeathcity
composer
Mahler18601911Kaliste
Beethoven17701827Bonn
Puccini18581924Lucques
Shostakovich19061975Saint-Petersburg
\n", "
" ], "text/plain": [ " birth death city\n", "composer \n", "Mahler 1860 1911 Kaliste\n", "Beethoven 1770 1827 Bonn\n", "Puccini 1858 1924 Lucques\n", "Shostakovich 1906 1975 Saint-Petersburg" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'\n", "composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')\n", "composers1" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "executionInfo": { "elapsed": 1571, "status": "ok", "timestamp": 1614355561687, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "NOUxMkx4kr0V", "outputId": "8b06afc5-6ce7-4903-e393-56a3dd8a17de", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeathcity
composer
Verdi18131901Roncole
Dvorak18411904Nelahozeves
Schumann18101856Zwickau
Stravinsky18821971Oranienbaum
Mahler18601911Kaliste
\n", "
" ], "text/plain": [ " birth death city\n", "composer \n", "Verdi 1813 1901 Roncole\n", "Dvorak 1841 1904 Nelahozeves\n", "Schumann 1810 1856 Zwickau\n", "Stravinsky 1882 1971 Oranienbaum\n", "Mahler 1860 1911 Kaliste" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers2 = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true', index_col='composer',sheet_name='Sheet3')\n", "composers2" ] }, { "cell_type": "markdown", "metadata": { "id": "zag-3SYukr0W" }, "source": [ "We see that we have the same information in both tables, but for different composers. We can just concetenate them using the ```pd.concat``` function and providing the two DataFrames in a list:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "kDoMbe-ekr0W", "tags": [] }, "outputs": [], "source": [ "all_composers = pd.concat([composers1,composers2])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 359 }, "executionInfo": { "elapsed": 2141, "status": "ok", "timestamp": 1614355623399, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "dhr9Mnqkkr0W", "outputId": "b3563973-9c41-4755-acbc-fa9c19e4b421", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeathcity
composer
Mahler18601911Kaliste
Beethoven17701827Bonn
Puccini18581924Lucques
Shostakovich19061975Saint-Petersburg
Verdi18131901Roncole
Dvorak18411904Nelahozeves
Schumann18101856Zwickau
Stravinsky18821971Oranienbaum
Mahler18601911Kaliste
\n", "
" ], "text/plain": [ " birth death city\n", "composer \n", "Mahler 1860 1911 Kaliste\n", "Beethoven 1770 1827 Bonn\n", "Puccini 1858 1924 Lucques\n", "Shostakovich 1906 1975 Saint-Petersburg\n", "Verdi 1813 1901 Roncole\n", "Dvorak 1841 1904 Nelahozeves\n", "Schumann 1810 1856 Zwickau\n", "Stravinsky 1882 1971 Oranienbaum\n", "Mahler 1860 1911 Kaliste" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_composers" ] }, { "cell_type": "markdown", "metadata": { "id": "sa-zZEs2kr0W" }, "source": [ "One potential problem is that two tables contain duplicated information, like here the Mahler entry. Pandas offers a lot of useful functions to clean-up data, for example ```drop_duplicates()```:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 328 }, "executionInfo": { "elapsed": 929, "status": "ok", "timestamp": 1614355632075, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "LuHWdzRouqv0", "outputId": "67f247f2-0bdd-435d-9cf1-2f2275182c52", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeathcity
composer
Mahler18601911Kaliste
Beethoven17701827Bonn
Puccini18581924Lucques
Shostakovich19061975Saint-Petersburg
Verdi18131901Roncole
Dvorak18411904Nelahozeves
Schumann18101856Zwickau
Stravinsky18821971Oranienbaum
\n", "
" ], "text/plain": [ " birth death city\n", "composer \n", "Mahler 1860 1911 Kaliste\n", "Beethoven 1770 1827 Bonn\n", "Puccini 1858 1924 Lucques\n", "Shostakovich 1906 1975 Saint-Petersburg\n", "Verdi 1813 1901 Roncole\n", "Dvorak 1841 1904 Nelahozeves\n", "Schumann 1810 1856 Zwickau\n", "Stravinsky 1882 1971 Oranienbaum" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_composers.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": { "id": "RCNQd-DTkr0Y" }, "source": [ "## Joining two tables" ] }, { "cell_type": "markdown", "metadata": { "id": "qHbJD2FWkr0Y" }, "source": [ "A slightly more complex case is if we have two datasets with (almost) the same items, in our case composers, but with different information that we want to combine. For example:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 535, "status": "ok", "timestamp": 1614355658781, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "lKODXCAOkr0Y", "outputId": "a46530b4-d3ff-409e-c06d-5bd3821ef354", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeathcity
composer
Mahler18601911Kaliste
Beethoven17701827Bonn
Puccini18581924Lucques
Shostakovich19061975Saint-Petersburg
\n", "
" ], "text/plain": [ " birth death city\n", "composer \n", "Mahler 1860 1911 Kaliste\n", "Beethoven 1770 1827 Bonn\n", "Puccini 1858 1924 Lucques\n", "Shostakovich 1906 1975 Saint-Petersburg" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')\n", "composers1" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 529, "status": "ok", "timestamp": 1614355660084, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "vnbi4f3Pkr0Y", "outputId": "58ed7870-59e9-4db1-9960-b694d9aa5560", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first name
composer
MahlerGustav
BeethovenLudwig van
PucciniGiacomo
BrahmsJohannes
\n", "
" ], "text/plain": [ " first name\n", "composer \n", "Mahler Gustav\n", "Beethoven Ludwig van\n", "Puccini Giacomo\n", "Brahms Johannes" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers2 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet4')\n", "composers2" ] }, { "cell_type": "markdown", "metadata": { "id": "RcxqfLPwkr0Y" }, "source": [ "We cannot concatenate the tables as they have different columns. However we can go through both tables and **merge for each item the information of both tables**. Note that each table has **one** element that doesn't appear in the other table (Shostakovitch and Brahms).\n", "\n", "We can do this merging operation with the ```join``` function, which one uses like this ```left_table.join(right_table)```: \n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 484, "status": "ok", "timestamp": 1614355699842, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "I-IsWWFxkr0Z", "outputId": "25d77a1b-0025-42ee-eed6-767213672c9e", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeathcityfirst name
composer
Mahler18601911KalisteGustav
Beethoven17701827BonnLudwig van
Puccini18581924LucquesGiacomo
Shostakovich19061975Saint-PetersburgNaN
\n", "
" ], "text/plain": [ " birth death city first name\n", "composer \n", "Mahler 1860 1911 Kaliste Gustav\n", "Beethoven 1770 1827 Bonn Ludwig van\n", "Puccini 1858 1924 Lucques Giacomo\n", "Shostakovich 1906 1975 Saint-Petersburg NaN" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers1.join(composers2)" ] }, { "cell_type": "markdown", "metadata": { "id": "fxt8LU4Kkr0a" }, "source": [ "We see that:\n", "1. Pandas used the index to merge all elements\n", "2. Brahms which was missing in ```composers1``` is absent\n", "3. The first name of Shostakovitch is missing since that composer was absent in ```composers2```.\n", "\n", "In other terms, ```join``` used all the elements of the ```left``` table and only those matching from the ```right``` table. This is just a default that can be overriden using the ```how``` option.\n", "\n", "In total we have four possibilities as illustrated here:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 516 }, "executionInfo": { "elapsed": 581, "status": "ok", "timestamp": 1614531167787, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "zDe-n9GXjiG3", "outputId": "98f43869-5ad1-4409-e824-39d8c73b3d89", "tags": [] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import Image\n", "Image(url='https://github.com/guiwitz/ISDAwPython_day2/raw/master/images/left_right.jpeg',width=700)" ] }, { "cell_type": "markdown", "metadata": { "id": "txardHSHjkJO" }, "source": [ "1. keep all elements of the ```rigth``` table and complete with the matching ones from the ```left``` table:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 547, "status": "ok", "timestamp": 1614355732809, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "MprjZbzZkr0b", "outputId": "8f173252-67ce-4fbf-bd9f-772f63d7712f", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeathcityfirst name
composer
Mahler1860.01911.0KalisteGustav
Beethoven1770.01827.0BonnLudwig van
Puccini1858.01924.0LucquesGiacomo
BrahmsNaNNaNNaNJohannes
\n", "
" ], "text/plain": [ " birth death city first name\n", "composer \n", "Mahler 1860.0 1911.0 Kaliste Gustav\n", "Beethoven 1770.0 1827.0 Bonn Ludwig van\n", "Puccini 1858.0 1924.0 Lucques Giacomo\n", "Brahms NaN NaN NaN Johannes" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers1.join(composers2, how = 'right')" ] }, { "cell_type": "markdown", "metadata": { "id": "lwrHxbp3xuPh" }, "source": [ "2. Keep all elements of both tables:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "executionInfo": { "elapsed": 491, "status": "ok", "timestamp": 1614355736661, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "_9kWG9SWkr0c", "outputId": "be2a2725-24dc-409a-9a8d-69d122058bfa", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeathcityfirst name
composer
Beethoven1770.01827.0BonnLudwig van
BrahmsNaNNaNNaNJohannes
Mahler1860.01911.0KalisteGustav
Puccini1858.01924.0LucquesGiacomo
Shostakovich1906.01975.0Saint-PetersburgNaN
\n", "
" ], "text/plain": [ " birth death city first name\n", "composer \n", "Beethoven 1770.0 1827.0 Bonn Ludwig van\n", "Brahms NaN NaN NaN Johannes\n", "Mahler 1860.0 1911.0 Kaliste Gustav\n", "Puccini 1858.0 1924.0 Lucques Giacomo\n", "Shostakovich 1906.0 1975.0 Saint-Petersburg NaN" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers1.join(composers2, how = 'outer')" ] }, { "cell_type": "markdown", "metadata": { "id": "ArqBxRgSyHrl" }, "source": [ "3. Keep **only** elements present in **both tables**:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 543, "status": "ok", "timestamp": 1614355739788, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "-E5_z4Idkr0c", "outputId": "daf26cf9-a6b3-4ecd-84b8-a96fa4bd1427", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthdeathcityfirst name
composer
Mahler18601911KalisteGustav
Beethoven17701827BonnLudwig van
Puccini18581924LucquesGiacomo
\n", "
" ], "text/plain": [ " birth death city first name\n", "composer \n", "Mahler 1860 1911 Kaliste Gustav\n", "Beethoven 1770 1827 Bonn Ludwig van\n", "Puccini 1858 1924 Lucques Giacomo" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers1.join(composers2, how = 'inner')" ] }, { "cell_type": "markdown", "metadata": { "id": "4gcWC82rkr0c" }, "source": [ "## More general approach\n", "\n", "Here we used the index to know which items to combined together, but more generally one can use **any** column for merging. The ```pd.merge``` function offers more flexibility in this regard. We whos only one example here:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "v_1PE2ANkr0d", "tags": [] }, "outputs": [], "source": [ "composers1 = pd.read_excel(file_url, sheet_name='Sheet1')\n", "composers2 = pd.read_excel(file_url, sheet_name='Sheet6')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 783, "status": "ok", "timestamp": 1614355747801, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "lN-GpM95kr0d", "outputId": "9489f158-966d-45d7-da80-1847c67d97e4", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcity
0Mahler18601911Kaliste
1Beethoven17701827Bonn
2Puccini18581924Lucques
3Shostakovich19061975Saint-Petersburg
\n", "
" ], "text/plain": [ " composer birth death city\n", "0 Mahler 1860 1911 Kaliste\n", "1 Beethoven 1770 1827 Bonn\n", "2 Puccini 1858 1924 Lucques\n", "3 Shostakovich 1906 1975 Saint-Petersburg" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers1" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 473, "status": "ok", "timestamp": 1614355749559, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "03-VQTJmkr0d", "outputId": "3a926031-a238-4042-b903-b0552987b30f", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
last namefirst name
0PucciniGiacomo
1BeethovenLudwig van
2BrahmsJohannes
3MahlerGustav
\n", "
" ], "text/plain": [ " last name first name\n", "0 Puccini Giacomo\n", "1 Beethoven Ludwig van\n", "2 Brahms Johannes\n", "3 Mahler Gustav" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composers2" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 481, "status": "ok", "timestamp": 1614355762515, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "2-_TN1rGkr0d", "outputId": "df150c99-b0b3-49c7-e356-8bd076513c8e", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
composerbirthdeathcitylast namefirst name
0Mahler1860.01911.0KalisteMahlerGustav
1Beethoven1770.01827.0BonnBeethovenLudwig van
2Puccini1858.01924.0LucquesPucciniGiacomo
3Shostakovich1906.01975.0Saint-PetersburgNaNNaN
4NaNNaNNaNNaNBrahmsJohannes
\n", "
" ], "text/plain": [ " composer birth death city last name first name\n", "0 Mahler 1860.0 1911.0 Kaliste Mahler Gustav\n", "1 Beethoven 1770.0 1827.0 Bonn Beethoven Ludwig van\n", "2 Puccini 1858.0 1924.0 Lucques Puccini Giacomo\n", "3 Shostakovich 1906.0 1975.0 Saint-Petersburg NaN NaN\n", "4 NaN NaN NaN NaN Brahms Johannes" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(composers1, composers2, left_on='composer', right_on='last name', how='outer')" ] }, { "cell_type": "markdown", "metadata": { "id": "paXowyLGzBeE" }, "source": [ "The concepts are the same as in the ```join()``` function. The first table (```composers1```) is the ```left``` table, the second one (```composers2```) is the ```right``` table. And then we specify in ```left_on``` and ```right_on``` which columns to use for the merging. Finally, the ```how```option is the same." ] }, { "cell_type": "markdown", "metadata": { "id": "8Sld_Uwk_2Pc" }, "source": [ "## Grouping\n", "\n", "Very often when you want to calculate some statistics on a dataset you need to group some of the data. Let's look at our penguin dataset:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "s7qclqKVAibe", "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "executionInfo": { "elapsed": 511, "status": "ok", "timestamp": 1614454936929, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "360sD8XQAwK7", "outputId": "4faf0e50-5894-41ab-dc43-2d37910f6911", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
0AdelieTorgersen39.118.7181.03750.0male2007
1AdelieTorgersen39.517.4186.03800.0female2007
2AdelieTorgersen40.318.0195.03250.0female2007
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 39.1 18.7 181.0 \n", "1 Adelie Torgersen 39.5 17.4 186.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "\n", " body_mass_g sex year \n", "0 3750.0 male 2007 \n", "1 3800.0 female 2007 \n", "2 3250.0 female 2007 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.head(3)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 462, "status": "ok", "timestamp": 1614454947104, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "-sdCTkOqA1vE", "outputId": "d3942ef9-78ce-47ab-ed61-1b8a970b8e01", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.species.unique()" ] }, { "cell_type": "markdown", "metadata": { "id": "-IL1_PTCA4Hc" }, "source": [ "We see that we have data for three different species of Penguins. What if we want to have the average weight of penguins by species ? One solution would be to make a for loop e.g.:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "LmVe67dSBF9h", "tags": [] }, "outputs": [], "source": [ "weights = []\n", "for species in penguins.species.unique():\n", " mean_w = penguins.loc[penguins['species'] == species,'body_mass_g'].mean()\n", " weights.append(mean_w)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 549, "status": "ok", "timestamp": 1614455133162, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "BH2vkmsLBkgc", "outputId": "02d44557-e42a-4c66-9b12-ca00fbc18212", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "[3700.662251655629, 5076.016260162602, 3733.0882352941176]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weights" ] }, { "cell_type": "markdown", "metadata": { "id": "QkTqIgE0Bldd" }, "source": [ "Pandas offers a much simpler solution for this kind of operation with the ```.groupby``` function. Here we only specifiy which columns should be used to form groups and it does the work for us:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "id": "nL4XrwfHB3N2", "tags": [] }, "outputs": [], "source": [ "grouped_penguins = penguins.groupby('species')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 470, "status": "ok", "timestamp": 1614455225637, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "H2Zc_7LiB64Q", "outputId": "faa62d51-2830-45ab-f46c-2350a09269f1", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_penguins" ] }, { "cell_type": "markdown", "metadata": { "id": "3WVHs0qmB79x" }, "source": [ "The output is not directly visible. It's a collection of tables grouped according to species. Many functions allow us to get one of these subgroups, e.g.:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 419 }, "executionInfo": { "elapsed": 542, "status": "ok", "timestamp": 1614455280029, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "dpOuhFV5CF8F", "outputId": "99d914c1-009b-4dd2-86e2-7bdc6d398646", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
0AdelieTorgersen39.118.7181.03750.0male2007
1AdelieTorgersen39.517.4186.03800.0female2007
2AdelieTorgersen40.318.0195.03250.0female2007
3AdelieTorgersenNaNNaNNaNNaNNaN2007
4AdelieTorgersen36.719.3193.03450.0female2007
...........................
147AdelieDream36.618.4184.03475.0female2009
148AdelieDream36.017.8195.03450.0female2009
149AdelieDream37.818.1193.03750.0male2009
150AdelieDream36.017.1187.03700.0female2009
151AdelieDream41.518.5201.04000.0male2009
\n", "

152 rows × 8 columns

\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 39.1 18.7 181.0 \n", "1 Adelie Torgersen 39.5 17.4 186.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "3 Adelie Torgersen NaN NaN NaN \n", "4 Adelie Torgersen 36.7 19.3 193.0 \n", ".. ... ... ... ... ... \n", "147 Adelie Dream 36.6 18.4 184.0 \n", "148 Adelie Dream 36.0 17.8 195.0 \n", "149 Adelie Dream 37.8 18.1 193.0 \n", "150 Adelie Dream 36.0 17.1 187.0 \n", "151 Adelie Dream 41.5 18.5 201.0 \n", "\n", " body_mass_g sex year \n", "0 3750.0 male 2007 \n", "1 3800.0 female 2007 \n", "2 3250.0 female 2007 \n", "3 NaN NaN 2007 \n", "4 3450.0 female 2007 \n", ".. ... ... ... \n", "147 3475.0 female 2009 \n", "148 3450.0 female 2009 \n", "149 3750.0 male 2009 \n", "150 3700.0 female 2009 \n", "151 4000.0 male 2009 \n", "\n", "[152 rows x 8 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_penguins.get_group('Adelie')" ] }, { "cell_type": "markdown", "metadata": { "id": "Ao4wCb6eCJut" }, "source": [ "However we wont go here through the details of the group data structure. What we can simply do is magically apply functions directly on the grouped object. For example to get the mean:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "executionInfo": { "elapsed": 567, "status": "ok", "timestamp": 1614455346981, "user": { "displayName": "Guillaume Witz", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64", "userId": "16033870147214403532" }, "user_tz": -60 }, "id": "2DYR5mgwCXkv", "outputId": "2647df41-d8f2-4793-9f7c-4bd32cf37bc5", "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/mk/632_7fgs4v374qc935pvf9v00000gn/T/ipykernel_95390/491459327.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n", " grouped_penguins.mean()\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmflipper_length_mmbody_mass_gyear
species
Adelie38.79139118.346358189.9536423700.6622522008.013158
Chinstrap48.83382418.420588195.8235293733.0882352007.970588
Gentoo47.50487814.982114217.1869925076.0162602008.080645
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm flipper_length_mm body_mass_g \\\n", "species \n", "Adelie 38.791391 18.346358 189.953642 3700.662252 \n", "Chinstrap 48.833824 18.420588 195.823529 3733.088235 \n", "Gentoo 47.504878 14.982114 217.186992 5076.016260 \n", "\n", " year \n", "species \n", "Adelie 2008.013158 \n", "Chinstrap 2007.970588 \n", "Gentoo 2008.080645 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_penguins.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a good example of a warning for how we are supposed to use a function. Here the behavior is changing and the developers are warning us of a future change that might break our code. We can try to fix it:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmflipper_length_mmbody_mass_gyear
species
Adelie38.79139118.346358189.9536423700.6622522008.013158
Chinstrap48.83382418.420588195.8235293733.0882352007.970588
Gentoo47.50487814.982114217.1869925076.0162602008.080645
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm flipper_length_mm body_mass_g \\\n", "species \n", "Adelie 38.791391 18.346358 189.953642 3700.662252 \n", "Chinstrap 48.833824 18.420588 195.823529 3733.088235 \n", "Gentoo 47.504878 14.982114 217.186992 5076.016260 \n", "\n", " year \n", "species \n", "Adelie 2008.013158 \n", "Chinstrap 2007.970588 \n", "Gentoo 2008.080645 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_penguins.mean(numeric_only=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "QU-8oVg-CZpG" }, "source": [ "As you can see Pandas automatically computes the mean for each category and each column. The output then is a DataFrame where each line corresponds to a given category. One can push this further by using e.g. multiple columns for grouping, but this goes beyond the present course." ] }, { "cell_type": "markdown", "metadata": { "id": "H2ZRaOyyzYII" }, "source": [ "## Exercises\n", "\n", "1. Load a first DataFrame that contains information about covid19 vaccination locations: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/locations.csv \n", "2. Load a second DataFrame which contains daily information about vaccination for every country: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/vaccinations.csv\n", "3. Create a sub-dataframe of the table in (1) that only contains the fields ```location``` and ```vaccines```.\n", "4. Merge the two tables using the ```location``` field as key for merging. Use once ```left``` merge and once ```right``` marge.\n", "5. Do you see a difference in the result?\n", "5b. Bonus question: knowing that you can use the ```.unique()``` method on columns, and that there's a function called ```np.isin``` that compares two lists, can you figure out which country is missing ?\n", "6. Using the table from (1), group the data by ```location``` and calculate the average number of ```daily_vaccinations```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "II0smv8RPQTc" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "09-Pandas_combine.ipynb", "provenance": [ { "file_id": "https://github.com/guiwitz/NumpyPandas_course/blob/colab/10-DA_Pandas_combine.ipynb", "timestamp": 1614296569460 } ], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 4 }