{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "id": "kjwMVdVbkr0P"
   },
   "source": [
    "# 10. Combining and grouping DataFrames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "2mM-peDakr0T",
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jAUEMdaLkr0U"
   },
   "source": [
    "Two of the most important operations one can do with DataFrames is 1) combine multiple sources of information and 2) efficiently group and summarize information. Both of these topics are vast and here again, we ony present a few important approaches."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "co9Kh2lakr0U"
   },
   "source": [
    "## Concatenation\n",
    "\n",
    "The simplest case where we need to combine multiple sources of information is if those sources are of the same \"type\", i.e. they have the same columns but different entries. Imagine for example lists of participants to an event broken in muliple files with names starting with A-K, L-S etc. In that case we can just \"glue\" datasets together or more formally **concatenate** them. Here we load to Excel sheets with composer information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "executionInfo": {
     "elapsed": 1548,
     "status": "ok",
     "timestamp": 1614355561199,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "wnMcwQe7kr0U",
    "outputId": "5e3b58ad-7265-4b5f-b5e1-fe1391f55763",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860</td>\n",
       "      <td>1911</td>\n",
       "      <td>Kaliste</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beethoven</th>\n",
       "      <td>1770</td>\n",
       "      <td>1827</td>\n",
       "      <td>Bonn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puccini</th>\n",
       "      <td>1858</td>\n",
       "      <td>1924</td>\n",
       "      <td>Lucques</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Shostakovich</th>\n",
       "      <td>1906</td>\n",
       "      <td>1975</td>\n",
       "      <td>Saint-Petersburg</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              birth  death              city\n",
       "composer                                    \n",
       "Mahler         1860   1911           Kaliste\n",
       "Beethoven      1770   1827              Bonn\n",
       "Puccini        1858   1924           Lucques\n",
       "Shostakovich   1906   1975  Saint-Petersburg"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "file_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'\n",
    "composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')\n",
    "composers1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 235
    },
    "executionInfo": {
     "elapsed": 1571,
     "status": "ok",
     "timestamp": 1614355561687,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "NOUxMkx4kr0V",
    "outputId": "8b06afc5-6ce7-4903-e393-56a3dd8a17de",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Verdi</th>\n",
       "      <td>1813</td>\n",
       "      <td>1901</td>\n",
       "      <td>Roncole</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dvorak</th>\n",
       "      <td>1841</td>\n",
       "      <td>1904</td>\n",
       "      <td>Nelahozeves</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Schumann</th>\n",
       "      <td>1810</td>\n",
       "      <td>1856</td>\n",
       "      <td>Zwickau</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Stravinsky</th>\n",
       "      <td>1882</td>\n",
       "      <td>1971</td>\n",
       "      <td>Oranienbaum</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860</td>\n",
       "      <td>1911</td>\n",
       "      <td>Kaliste</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            birth  death         city\n",
       "composer                             \n",
       "Verdi        1813   1901      Roncole\n",
       "Dvorak       1841   1904  Nelahozeves\n",
       "Schumann     1810   1856      Zwickau\n",
       "Stravinsky   1882   1971  Oranienbaum\n",
       "Mahler       1860   1911      Kaliste"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "composers2 = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true', index_col='composer',sheet_name='Sheet3')\n",
    "composers2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zag-3SYukr0W"
   },
   "source": [
    "We see that we have the same information in both tables, but for different composers. We can just concetenate them using the ```pd.concat``` function and providing the two DataFrames in a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "id": "kDoMbe-ekr0W",
    "tags": []
   },
   "outputs": [],
   "source": [
    "all_composers = pd.concat([composers1,composers2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 359
    },
    "executionInfo": {
     "elapsed": 2141,
     "status": "ok",
     "timestamp": 1614355623399,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "dhr9Mnqkkr0W",
    "outputId": "b3563973-9c41-4755-acbc-fa9c19e4b421",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860</td>\n",
       "      <td>1911</td>\n",
       "      <td>Kaliste</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beethoven</th>\n",
       "      <td>1770</td>\n",
       "      <td>1827</td>\n",
       "      <td>Bonn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puccini</th>\n",
       "      <td>1858</td>\n",
       "      <td>1924</td>\n",
       "      <td>Lucques</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Shostakovich</th>\n",
       "      <td>1906</td>\n",
       "      <td>1975</td>\n",
       "      <td>Saint-Petersburg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Verdi</th>\n",
       "      <td>1813</td>\n",
       "      <td>1901</td>\n",
       "      <td>Roncole</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dvorak</th>\n",
       "      <td>1841</td>\n",
       "      <td>1904</td>\n",
       "      <td>Nelahozeves</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Schumann</th>\n",
       "      <td>1810</td>\n",
       "      <td>1856</td>\n",
       "      <td>Zwickau</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Stravinsky</th>\n",
       "      <td>1882</td>\n",
       "      <td>1971</td>\n",
       "      <td>Oranienbaum</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860</td>\n",
       "      <td>1911</td>\n",
       "      <td>Kaliste</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              birth  death              city\n",
       "composer                                    \n",
       "Mahler         1860   1911           Kaliste\n",
       "Beethoven      1770   1827              Bonn\n",
       "Puccini        1858   1924           Lucques\n",
       "Shostakovich   1906   1975  Saint-Petersburg\n",
       "Verdi          1813   1901           Roncole\n",
       "Dvorak         1841   1904       Nelahozeves\n",
       "Schumann       1810   1856           Zwickau\n",
       "Stravinsky     1882   1971       Oranienbaum\n",
       "Mahler         1860   1911           Kaliste"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_composers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sa-zZEs2kr0W"
   },
   "source": [
    "One potential problem is that two tables contain duplicated information, like here the Mahler entry. Pandas offers a lot of useful functions to clean-up data, for example ```drop_duplicates()```:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 328
    },
    "executionInfo": {
     "elapsed": 929,
     "status": "ok",
     "timestamp": 1614355632075,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "LuHWdzRouqv0",
    "outputId": "67f247f2-0bdd-435d-9cf1-2f2275182c52",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860</td>\n",
       "      <td>1911</td>\n",
       "      <td>Kaliste</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beethoven</th>\n",
       "      <td>1770</td>\n",
       "      <td>1827</td>\n",
       "      <td>Bonn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puccini</th>\n",
       "      <td>1858</td>\n",
       "      <td>1924</td>\n",
       "      <td>Lucques</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Shostakovich</th>\n",
       "      <td>1906</td>\n",
       "      <td>1975</td>\n",
       "      <td>Saint-Petersburg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Verdi</th>\n",
       "      <td>1813</td>\n",
       "      <td>1901</td>\n",
       "      <td>Roncole</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dvorak</th>\n",
       "      <td>1841</td>\n",
       "      <td>1904</td>\n",
       "      <td>Nelahozeves</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Schumann</th>\n",
       "      <td>1810</td>\n",
       "      <td>1856</td>\n",
       "      <td>Zwickau</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Stravinsky</th>\n",
       "      <td>1882</td>\n",
       "      <td>1971</td>\n",
       "      <td>Oranienbaum</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              birth  death              city\n",
       "composer                                    \n",
       "Mahler         1860   1911           Kaliste\n",
       "Beethoven      1770   1827              Bonn\n",
       "Puccini        1858   1924           Lucques\n",
       "Shostakovich   1906   1975  Saint-Petersburg\n",
       "Verdi          1813   1901           Roncole\n",
       "Dvorak         1841   1904       Nelahozeves\n",
       "Schumann       1810   1856           Zwickau\n",
       "Stravinsky     1882   1971       Oranienbaum"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_composers.drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RCNQd-DTkr0Y"
   },
   "source": [
    "## Joining two tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qHbJD2FWkr0Y"
   },
   "source": [
    "A slightly more complex case is if we have two datasets with (almost) the same items, in our case composers, but with different information that we want to combine. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "executionInfo": {
     "elapsed": 535,
     "status": "ok",
     "timestamp": 1614355658781,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "lKODXCAOkr0Y",
    "outputId": "a46530b4-d3ff-409e-c06d-5bd3821ef354",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860</td>\n",
       "      <td>1911</td>\n",
       "      <td>Kaliste</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beethoven</th>\n",
       "      <td>1770</td>\n",
       "      <td>1827</td>\n",
       "      <td>Bonn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puccini</th>\n",
       "      <td>1858</td>\n",
       "      <td>1924</td>\n",
       "      <td>Lucques</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Shostakovich</th>\n",
       "      <td>1906</td>\n",
       "      <td>1975</td>\n",
       "      <td>Saint-Petersburg</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              birth  death              city\n",
       "composer                                    \n",
       "Mahler         1860   1911           Kaliste\n",
       "Beethoven      1770   1827              Bonn\n",
       "Puccini        1858   1924           Lucques\n",
       "Shostakovich   1906   1975  Saint-Petersburg"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')\n",
    "composers1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "executionInfo": {
     "elapsed": 529,
     "status": "ok",
     "timestamp": 1614355660084,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "vnbi4f3Pkr0Y",
    "outputId": "58ed7870-59e9-4db1-9960-b694d9aa5560",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>Gustav</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beethoven</th>\n",
       "      <td>Ludwig van</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puccini</th>\n",
       "      <td>Giacomo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brahms</th>\n",
       "      <td>Johannes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           first name\n",
       "composer             \n",
       "Mahler         Gustav\n",
       "Beethoven  Ludwig van\n",
       "Puccini       Giacomo\n",
       "Brahms       Johannes"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "composers2 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet4')\n",
    "composers2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RcxqfLPwkr0Y"
   },
   "source": [
    "We cannot concatenate the tables as they have different columns. However we can go through both tables and **merge for each item the information of both tables**. Note that each table has **one** element that doesn't appear in the other table (Shostakovitch and Brahms).\n",
    "\n",
    "We can do this merging operation with the ```join``` function, which one uses like this ```left_table.join(right_table)```: \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "executionInfo": {
     "elapsed": 484,
     "status": "ok",
     "timestamp": 1614355699842,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "I-IsWWFxkr0Z",
    "outputId": "25d77a1b-0025-42ee-eed6-767213672c9e",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "      <th>first name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860</td>\n",
       "      <td>1911</td>\n",
       "      <td>Kaliste</td>\n",
       "      <td>Gustav</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beethoven</th>\n",
       "      <td>1770</td>\n",
       "      <td>1827</td>\n",
       "      <td>Bonn</td>\n",
       "      <td>Ludwig van</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puccini</th>\n",
       "      <td>1858</td>\n",
       "      <td>1924</td>\n",
       "      <td>Lucques</td>\n",
       "      <td>Giacomo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Shostakovich</th>\n",
       "      <td>1906</td>\n",
       "      <td>1975</td>\n",
       "      <td>Saint-Petersburg</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              birth  death              city  first name\n",
       "composer                                                \n",
       "Mahler         1860   1911           Kaliste      Gustav\n",
       "Beethoven      1770   1827              Bonn  Ludwig van\n",
       "Puccini        1858   1924           Lucques     Giacomo\n",
       "Shostakovich   1906   1975  Saint-Petersburg         NaN"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "composers1.join(composers2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fxt8LU4Kkr0a"
   },
   "source": [
    "We see that:\n",
    "1. Pandas used the index to merge all elements\n",
    "2. Brahms which was missing in ```composers1``` is absent\n",
    "3. The first name of Shostakovitch is missing since that composer was absent in ```composers2```.\n",
    "\n",
    "In other terms, ```join``` used all the elements of the ```left``` table and only those matching from the ```right``` table. This is just a default that can be overriden using the ```how``` option.\n",
    "\n",
    "In total we have four possibilities as illustrated here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 516
    },
    "executionInfo": {
     "elapsed": 581,
     "status": "ok",
     "timestamp": 1614531167787,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "zDe-n9GXjiG3",
    "outputId": "98f43869-5ad1-4409-e824-39d8c73b3d89",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://github.com/guiwitz/ISDAwPython_day2/raw/master/images/left_right.jpeg\" width=\"700\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import Image\n",
    "Image(url='https://github.com/guiwitz/ISDAwPython_day2/raw/master/images/left_right.jpeg',width=700)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "txardHSHjkJO"
   },
   "source": [
    "1. keep all elements of the ```rigth``` table and complete with the matching ones from the ```left``` table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "executionInfo": {
     "elapsed": 547,
     "status": "ok",
     "timestamp": 1614355732809,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "MprjZbzZkr0b",
    "outputId": "8f173252-67ce-4fbf-bd9f-772f63d7712f",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "      <th>first name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860.0</td>\n",
       "      <td>1911.0</td>\n",
       "      <td>Kaliste</td>\n",
       "      <td>Gustav</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beethoven</th>\n",
       "      <td>1770.0</td>\n",
       "      <td>1827.0</td>\n",
       "      <td>Bonn</td>\n",
       "      <td>Ludwig van</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puccini</th>\n",
       "      <td>1858.0</td>\n",
       "      <td>1924.0</td>\n",
       "      <td>Lucques</td>\n",
       "      <td>Giacomo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brahms</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Johannes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            birth   death     city  first name\n",
       "composer                                      \n",
       "Mahler     1860.0  1911.0  Kaliste      Gustav\n",
       "Beethoven  1770.0  1827.0     Bonn  Ludwig van\n",
       "Puccini    1858.0  1924.0  Lucques     Giacomo\n",
       "Brahms        NaN     NaN      NaN    Johannes"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "composers1.join(composers2, how = 'right')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lwrHxbp3xuPh"
   },
   "source": [
    "2. Keep all elements of both tables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 235
    },
    "executionInfo": {
     "elapsed": 491,
     "status": "ok",
     "timestamp": 1614355736661,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "_9kWG9SWkr0c",
    "outputId": "be2a2725-24dc-409a-9a8d-69d122058bfa",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "      <th>first name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Beethoven</th>\n",
       "      <td>1770.0</td>\n",
       "      <td>1827.0</td>\n",
       "      <td>Bonn</td>\n",
       "      <td>Ludwig van</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brahms</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Johannes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860.0</td>\n",
       "      <td>1911.0</td>\n",
       "      <td>Kaliste</td>\n",
       "      <td>Gustav</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puccini</th>\n",
       "      <td>1858.0</td>\n",
       "      <td>1924.0</td>\n",
       "      <td>Lucques</td>\n",
       "      <td>Giacomo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Shostakovich</th>\n",
       "      <td>1906.0</td>\n",
       "      <td>1975.0</td>\n",
       "      <td>Saint-Petersburg</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               birth   death              city  first name\n",
       "composer                                                  \n",
       "Beethoven     1770.0  1827.0              Bonn  Ludwig van\n",
       "Brahms           NaN     NaN               NaN    Johannes\n",
       "Mahler        1860.0  1911.0           Kaliste      Gustav\n",
       "Puccini       1858.0  1924.0           Lucques     Giacomo\n",
       "Shostakovich  1906.0  1975.0  Saint-Petersburg         NaN"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "composers1.join(composers2, how = 'outer')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ArqBxRgSyHrl"
   },
   "source": [
    "3. Keep **only** elements present in **both tables**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 173
    },
    "executionInfo": {
     "elapsed": 543,
     "status": "ok",
     "timestamp": 1614355739788,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "-E5_z4Idkr0c",
    "outputId": "daf26cf9-a6b3-4ecd-84b8-a96fa4bd1427",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "      <th>first name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>composer</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Mahler</th>\n",
       "      <td>1860</td>\n",
       "      <td>1911</td>\n",
       "      <td>Kaliste</td>\n",
       "      <td>Gustav</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beethoven</th>\n",
       "      <td>1770</td>\n",
       "      <td>1827</td>\n",
       "      <td>Bonn</td>\n",
       "      <td>Ludwig van</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puccini</th>\n",
       "      <td>1858</td>\n",
       "      <td>1924</td>\n",
       "      <td>Lucques</td>\n",
       "      <td>Giacomo</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           birth  death     city  first name\n",
       "composer                                    \n",
       "Mahler      1860   1911  Kaliste      Gustav\n",
       "Beethoven   1770   1827     Bonn  Ludwig van\n",
       "Puccini     1858   1924  Lucques     Giacomo"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "composers1.join(composers2, how = 'inner')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4gcWC82rkr0c"
   },
   "source": [
    "## More general approach\n",
    "\n",
    "Here we used the index to know which items to combined together, but more generally one can use **any** column for merging. The ```pd.merge``` function offers more flexibility in this regard. We whos only one example here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "id": "v_1PE2ANkr0d",
    "tags": []
   },
   "outputs": [],
   "source": [
    "composers1 = pd.read_excel(file_url, sheet_name='Sheet1')\n",
    "composers2 = pd.read_excel(file_url, sheet_name='Sheet6')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 173
    },
    "executionInfo": {
     "elapsed": 783,
     "status": "ok",
     "timestamp": 1614355747801,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "lN-GpM95kr0d",
    "outputId": "9489f158-966d-45d7-da80-1847c67d97e4",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>composer</th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mahler</td>\n",
       "      <td>1860</td>\n",
       "      <td>1911</td>\n",
       "      <td>Kaliste</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Beethoven</td>\n",
       "      <td>1770</td>\n",
       "      <td>1827</td>\n",
       "      <td>Bonn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Puccini</td>\n",
       "      <td>1858</td>\n",
       "      <td>1924</td>\n",
       "      <td>Lucques</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Shostakovich</td>\n",
       "      <td>1906</td>\n",
       "      <td>1975</td>\n",
       "      <td>Saint-Petersburg</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       composer  birth  death              city\n",
       "0        Mahler   1860   1911           Kaliste\n",
       "1     Beethoven   1770   1827              Bonn\n",
       "2       Puccini   1858   1924           Lucques\n",
       "3  Shostakovich   1906   1975  Saint-Petersburg"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "composers1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 173
    },
    "executionInfo": {
     "elapsed": 473,
     "status": "ok",
     "timestamp": 1614355749559,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "03-VQTJmkr0d",
    "outputId": "3a926031-a238-4042-b903-b0552987b30f",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>last name</th>\n",
       "      <th>first name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Puccini</td>\n",
       "      <td>Giacomo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Beethoven</td>\n",
       "      <td>Ludwig van</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Brahms</td>\n",
       "      <td>Johannes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mahler</td>\n",
       "      <td>Gustav</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   last name  first name\n",
       "0    Puccini     Giacomo\n",
       "1  Beethoven  Ludwig van\n",
       "2     Brahms    Johannes\n",
       "3     Mahler      Gustav"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "composers2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "executionInfo": {
     "elapsed": 481,
     "status": "ok",
     "timestamp": 1614355762515,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "2-_TN1rGkr0d",
    "outputId": "df150c99-b0b3-49c7-e356-8bd076513c8e",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>composer</th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "      <th>city</th>\n",
       "      <th>last name</th>\n",
       "      <th>first name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mahler</td>\n",
       "      <td>1860.0</td>\n",
       "      <td>1911.0</td>\n",
       "      <td>Kaliste</td>\n",
       "      <td>Mahler</td>\n",
       "      <td>Gustav</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Beethoven</td>\n",
       "      <td>1770.0</td>\n",
       "      <td>1827.0</td>\n",
       "      <td>Bonn</td>\n",
       "      <td>Beethoven</td>\n",
       "      <td>Ludwig van</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Puccini</td>\n",
       "      <td>1858.0</td>\n",
       "      <td>1924.0</td>\n",
       "      <td>Lucques</td>\n",
       "      <td>Puccini</td>\n",
       "      <td>Giacomo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Shostakovich</td>\n",
       "      <td>1906.0</td>\n",
       "      <td>1975.0</td>\n",
       "      <td>Saint-Petersburg</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Brahms</td>\n",
       "      <td>Johannes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       composer   birth   death              city  last name  first name\n",
       "0        Mahler  1860.0  1911.0           Kaliste     Mahler      Gustav\n",
       "1     Beethoven  1770.0  1827.0              Bonn  Beethoven  Ludwig van\n",
       "2       Puccini  1858.0  1924.0           Lucques    Puccini     Giacomo\n",
       "3  Shostakovich  1906.0  1975.0  Saint-Petersburg        NaN         NaN\n",
       "4           NaN     NaN     NaN               NaN     Brahms    Johannes"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(composers1, composers2, left_on='composer', right_on='last name', how='outer')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "paXowyLGzBeE"
   },
   "source": [
    "The concepts are the same as in the ```join()``` function. The first table (```composers1```) is the ```left``` table, the second one (```composers2```) is the ```right``` table. And then we specify in ```left_on``` and ```right_on``` which columns to use for the merging. Finally, the ```how```option is the same."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8Sld_Uwk_2Pc"
   },
   "source": [
    "## Grouping\n",
    "\n",
    "Very often when you want to calculate some statistics on a dataset you need to group some of the data. Let's look at our penguin dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "id": "s7qclqKVAibe",
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 142
    },
    "executionInfo": {
     "elapsed": 511,
     "status": "ok",
     "timestamp": 1614454936929,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "360sD8XQAwK7",
    "outputId": "4faf0e50-5894-41ab-dc43-2d37910f6911",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>species</th>\n",
       "      <th>island</th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>sex</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>39.1</td>\n",
       "      <td>18.7</td>\n",
       "      <td>181.0</td>\n",
       "      <td>3750.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>39.5</td>\n",
       "      <td>17.4</td>\n",
       "      <td>186.0</td>\n",
       "      <td>3800.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>40.3</td>\n",
       "      <td>18.0</td>\n",
       "      <td>195.0</td>\n",
       "      <td>3250.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \\\n",
       "0  Adelie  Torgersen            39.1           18.7              181.0   \n",
       "1  Adelie  Torgersen            39.5           17.4              186.0   \n",
       "2  Adelie  Torgersen            40.3           18.0              195.0   \n",
       "\n",
       "   body_mass_g     sex  year  \n",
       "0       3750.0    male  2007  \n",
       "1       3800.0  female  2007  \n",
       "2       3250.0  female  2007  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "penguins.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 462,
     "status": "ok",
     "timestamp": 1614454947104,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "-sdCTkOqA1vE",
    "outputId": "d3942ef9-78ce-47ab-ed61-1b8a970b8e01",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "penguins.species.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-IL1_PTCA4Hc"
   },
   "source": [
    "We see that we have data for three different species of Penguins. What if we want to have the average weight of penguins by species ? One solution would be to make a for loop e.g.:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "id": "LmVe67dSBF9h",
    "tags": []
   },
   "outputs": [],
   "source": [
    "weights = []\n",
    "for species in penguins.species.unique():\n",
    "    mean_w = penguins.loc[penguins['species'] == species,'body_mass_g'].mean()\n",
    "    weights.append(mean_w)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 549,
     "status": "ok",
     "timestamp": 1614455133162,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "BH2vkmsLBkgc",
    "outputId": "02d44557-e42a-4c66-9b12-ca00fbc18212",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[3700.662251655629, 5076.016260162602, 3733.0882352941176]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QkTqIgE0Bldd"
   },
   "source": [
    "Pandas offers a much simpler solution for this kind of operation with the ```.groupby``` function. Here we only specifiy which columns should be used to form groups and it does the work for us:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "id": "nL4XrwfHB3N2",
    "tags": []
   },
   "outputs": [],
   "source": [
    "grouped_penguins = penguins.groupby('species')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 470,
     "status": "ok",
     "timestamp": 1614455225637,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "H2Zc_7LiB64Q",
    "outputId": "faa62d51-2830-45ab-f46c-2350a09269f1",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12f02b700>"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_penguins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3WVHs0qmB79x"
   },
   "source": [
    "The output is not directly visible. It's a collection of tables grouped according to species. Many functions allow us to get one of these subgroups, e.g.:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 419
    },
    "executionInfo": {
     "elapsed": 542,
     "status": "ok",
     "timestamp": 1614455280029,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "dpOuhFV5CF8F",
    "outputId": "99d914c1-009b-4dd2-86e2-7bdc6d398646",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>species</th>\n",
       "      <th>island</th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>sex</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>39.1</td>\n",
       "      <td>18.7</td>\n",
       "      <td>181.0</td>\n",
       "      <td>3750.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>39.5</td>\n",
       "      <td>17.4</td>\n",
       "      <td>186.0</td>\n",
       "      <td>3800.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>40.3</td>\n",
       "      <td>18.0</td>\n",
       "      <td>195.0</td>\n",
       "      <td>3250.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>36.7</td>\n",
       "      <td>19.3</td>\n",
       "      <td>193.0</td>\n",
       "      <td>3450.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Dream</td>\n",
       "      <td>36.6</td>\n",
       "      <td>18.4</td>\n",
       "      <td>184.0</td>\n",
       "      <td>3475.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Dream</td>\n",
       "      <td>36.0</td>\n",
       "      <td>17.8</td>\n",
       "      <td>195.0</td>\n",
       "      <td>3450.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Dream</td>\n",
       "      <td>37.8</td>\n",
       "      <td>18.1</td>\n",
       "      <td>193.0</td>\n",
       "      <td>3750.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>150</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Dream</td>\n",
       "      <td>36.0</td>\n",
       "      <td>17.1</td>\n",
       "      <td>187.0</td>\n",
       "      <td>3700.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151</th>\n",
       "      <td>Adelie</td>\n",
       "      <td>Dream</td>\n",
       "      <td>41.5</td>\n",
       "      <td>18.5</td>\n",
       "      <td>201.0</td>\n",
       "      <td>4000.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>152 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \\\n",
       "0    Adelie  Torgersen            39.1           18.7              181.0   \n",
       "1    Adelie  Torgersen            39.5           17.4              186.0   \n",
       "2    Adelie  Torgersen            40.3           18.0              195.0   \n",
       "3    Adelie  Torgersen             NaN            NaN                NaN   \n",
       "4    Adelie  Torgersen            36.7           19.3              193.0   \n",
       "..      ...        ...             ...            ...                ...   \n",
       "147  Adelie      Dream            36.6           18.4              184.0   \n",
       "148  Adelie      Dream            36.0           17.8              195.0   \n",
       "149  Adelie      Dream            37.8           18.1              193.0   \n",
       "150  Adelie      Dream            36.0           17.1              187.0   \n",
       "151  Adelie      Dream            41.5           18.5              201.0   \n",
       "\n",
       "     body_mass_g     sex  year  \n",
       "0         3750.0    male  2007  \n",
       "1         3800.0  female  2007  \n",
       "2         3250.0  female  2007  \n",
       "3            NaN     NaN  2007  \n",
       "4         3450.0  female  2007  \n",
       "..           ...     ...   ...  \n",
       "147       3475.0  female  2009  \n",
       "148       3450.0  female  2009  \n",
       "149       3750.0    male  2009  \n",
       "150       3700.0  female  2009  \n",
       "151       4000.0    male  2009  \n",
       "\n",
       "[152 rows x 8 columns]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_penguins.get_group('Adelie')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Ao4wCb6eCJut"
   },
   "source": [
    "However we wont go here through the details of the group data structure. What we can simply do is magically apply functions directly on the grouped object. For example to get the mean:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 173
    },
    "executionInfo": {
     "elapsed": 567,
     "status": "ok",
     "timestamp": 1614455346981,
     "user": {
      "displayName": "Guillaume Witz",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GgT0K2JVYzEsjzsS5nhkUVjUrSIJ5jHzXnBoYrmVf8=s64",
      "userId": "16033870147214403532"
     },
     "user_tz": -60
    },
    "id": "2DYR5mgwCXkv",
    "outputId": "2647df41-d8f2-4793-9f7c-4bd32cf37bc5",
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/mk/632_7fgs4v374qc935pvf9v00000gn/T/ipykernel_95390/491459327.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n",
      "  grouped_penguins.mean()\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>species</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Adelie</th>\n",
       "      <td>38.791391</td>\n",
       "      <td>18.346358</td>\n",
       "      <td>189.953642</td>\n",
       "      <td>3700.662252</td>\n",
       "      <td>2008.013158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Chinstrap</th>\n",
       "      <td>48.833824</td>\n",
       "      <td>18.420588</td>\n",
       "      <td>195.823529</td>\n",
       "      <td>3733.088235</td>\n",
       "      <td>2007.970588</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gentoo</th>\n",
       "      <td>47.504878</td>\n",
       "      <td>14.982114</td>\n",
       "      <td>217.186992</td>\n",
       "      <td>5076.016260</td>\n",
       "      <td>2008.080645</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  \\\n",
       "species                                                                    \n",
       "Adelie          38.791391      18.346358         189.953642  3700.662252   \n",
       "Chinstrap       48.833824      18.420588         195.823529  3733.088235   \n",
       "Gentoo          47.504878      14.982114         217.186992  5076.016260   \n",
       "\n",
       "                  year  \n",
       "species                 \n",
       "Adelie     2008.013158  \n",
       "Chinstrap  2007.970588  \n",
       "Gentoo     2008.080645  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_penguins.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a good example of a warning for how we are supposed to use a function. Here the behavior is changing and the developers are warning us of a future change that might break our code. We can try to fix it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>species</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Adelie</th>\n",
       "      <td>38.791391</td>\n",
       "      <td>18.346358</td>\n",
       "      <td>189.953642</td>\n",
       "      <td>3700.662252</td>\n",
       "      <td>2008.013158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Chinstrap</th>\n",
       "      <td>48.833824</td>\n",
       "      <td>18.420588</td>\n",
       "      <td>195.823529</td>\n",
       "      <td>3733.088235</td>\n",
       "      <td>2007.970588</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gentoo</th>\n",
       "      <td>47.504878</td>\n",
       "      <td>14.982114</td>\n",
       "      <td>217.186992</td>\n",
       "      <td>5076.016260</td>\n",
       "      <td>2008.080645</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  \\\n",
       "species                                                                    \n",
       "Adelie          38.791391      18.346358         189.953642  3700.662252   \n",
       "Chinstrap       48.833824      18.420588         195.823529  3733.088235   \n",
       "Gentoo          47.504878      14.982114         217.186992  5076.016260   \n",
       "\n",
       "                  year  \n",
       "species                 \n",
       "Adelie     2008.013158  \n",
       "Chinstrap  2007.970588  \n",
       "Gentoo     2008.080645  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_penguins.mean(numeric_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QU-8oVg-CZpG"
   },
   "source": [
    "As you can see Pandas automatically computes the mean for each category and each column. The output then is a DataFrame where each line corresponds to a given category. One can push this further by using e.g. multiple columns for grouping, but this goes beyond the present course."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "H2ZRaOyyzYII"
   },
   "source": [
    "## Exercises\n",
    "\n",
    "1. Load a first DataFrame that contains information about covid19 vaccination locations: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/locations.csv \n",
    "2. Load a second DataFrame which contains daily information about vaccination for every country: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/vaccinations.csv\n",
    "3. Create a sub-dataframe of the table in (1) that only contains the fields ```location``` and ```vaccines```.\n",
    "4. Merge the two tables using the ```location``` field as key for merging. Use once ```left``` merge and once ```right``` marge.\n",
    "5. Do you see a difference in the result?\n",
    "5b. Bonus question: knowing that you can use the ```.unique()``` method on columns, and that there's a function called ```np.isin``` that compares two lists, can you figure out which country is missing ?\n",
    "6. Using the table from (1), group the data by ```location``` and calculate the average number of ```daily_vaccinations```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "II0smv8RPQTc"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "09-Pandas_combine.ipynb",
   "provenance": [
    {
     "file_id": "https://github.com/guiwitz/NumpyPandas_course/blob/colab/10-DA_Pandas_combine.ipynb",
     "timestamp": 1614296569460
    }
   ],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}