10. Combining and grouping DataFrames#

import numpy as np
import pandas as pd

Two of the most important operations one can do with DataFrames is 1) combine multiple sources of information and 2) efficiently group and summarize information. Both of these topics are vast and here again, we ony present a few important approaches.

Concatenation#

The simplest case where we need to combine multiple sources of information is if those sources are of the same “type”, i.e. they have the same columns but different entries. Imagine for example lists of participants to an event broken in muliple files with names starting with A-K, L-S etc. In that case we can just “glue” datasets together or more formally concatenate them. Here we load to Excel sheets with composer information:

file_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'
composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')
composers1
birth death city
composer
Mahler 1860 1911 Kaliste
Beethoven 1770 1827 Bonn
Puccini 1858 1924 Lucques
Shostakovich 1906 1975 Saint-Petersburg
composers2 = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true', index_col='composer',sheet_name='Sheet3')
composers2
birth death city
composer
Verdi 1813 1901 Roncole
Dvorak 1841 1904 Nelahozeves
Schumann 1810 1856 Zwickau
Stravinsky 1882 1971 Oranienbaum
Mahler 1860 1911 Kaliste

We see that we have the same information in both tables, but for different composers. We can just concetenate them using the pd.concat function and providing the two DataFrames in a list:

all_composers = pd.concat([composers1,composers2])
all_composers
birth death city
composer
Mahler 1860 1911 Kaliste
Beethoven 1770 1827 Bonn
Puccini 1858 1924 Lucques
Shostakovich 1906 1975 Saint-Petersburg
Verdi 1813 1901 Roncole
Dvorak 1841 1904 Nelahozeves
Schumann 1810 1856 Zwickau
Stravinsky 1882 1971 Oranienbaum
Mahler 1860 1911 Kaliste

One potential problem is that two tables contain duplicated information, like here the Mahler entry. Pandas offers a lot of useful functions to clean-up data, for example drop_duplicates():

all_composers.drop_duplicates()
birth death city
composer
Mahler 1860 1911 Kaliste
Beethoven 1770 1827 Bonn
Puccini 1858 1924 Lucques
Shostakovich 1906 1975 Saint-Petersburg
Verdi 1813 1901 Roncole
Dvorak 1841 1904 Nelahozeves
Schumann 1810 1856 Zwickau
Stravinsky 1882 1971 Oranienbaum

Joining two tables#

A slightly more complex case is if we have two datasets with (almost) the same items, in our case composers, but with different information that we want to combine. For example:

composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')
composers1
birth death city
composer
Mahler 1860 1911 Kaliste
Beethoven 1770 1827 Bonn
Puccini 1858 1924 Lucques
Shostakovich 1906 1975 Saint-Petersburg
composers2 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet4')
composers2
first name
composer
Mahler Gustav
Beethoven Ludwig van
Puccini Giacomo
Brahms Johannes

We cannot concatenate the tables as they have different columns. However we can go through both tables and merge for each item the information of both tables. Note that each table has one element that doesn’t appear in the other table (Shostakovitch and Brahms).

We can do this merging operation with the join function, which one uses like this left_table.join(right_table):

composers1.join(composers2)
birth death city first name
composer
Mahler 1860 1911 Kaliste Gustav
Beethoven 1770 1827 Bonn Ludwig van
Puccini 1858 1924 Lucques Giacomo
Shostakovich 1906 1975 Saint-Petersburg NaN

We see that:

  1. Pandas used the index to merge all elements

  2. Brahms which was missing in composers1 is absent

  3. The first name of Shostakovitch is missing since that composer was absent in composers2.

In other terms, join used all the elements of the left table and only those matching from the right table. This is just a default that can be overriden using the how option.

In total we have four possibilities as illustrated here:

from IPython.display import Image
Image(url='https://github.com/guiwitz/ISDAwPython_day2/raw/master/images/left_right.jpeg',width=700)
  1. keep all elements of the rigth table and complete with the matching ones from the left table:

composers1.join(composers2, how = 'right')
birth death city first name
composer
Mahler 1860.0 1911.0 Kaliste Gustav
Beethoven 1770.0 1827.0 Bonn Ludwig van
Puccini 1858.0 1924.0 Lucques Giacomo
Brahms NaN NaN NaN Johannes
  1. Keep all elements of both tables:

composers1.join(composers2, how = 'outer')
birth death city first name
composer
Beethoven 1770.0 1827.0 Bonn Ludwig van
Brahms NaN NaN NaN Johannes
Mahler 1860.0 1911.0 Kaliste Gustav
Puccini 1858.0 1924.0 Lucques Giacomo
Shostakovich 1906.0 1975.0 Saint-Petersburg NaN
  1. Keep only elements present in both tables:

composers1.join(composers2, how = 'inner')
birth death city first name
composer
Mahler 1860 1911 Kaliste Gustav
Beethoven 1770 1827 Bonn Ludwig van
Puccini 1858 1924 Lucques Giacomo

More general approach#

Here we used the index to know which items to combined together, but more generally one can use any column for merging. The pd.merge function offers more flexibility in this regard. We whos only one example here:

composers1 = pd.read_excel(file_url, sheet_name='Sheet1')
composers2 = pd.read_excel(file_url, sheet_name='Sheet6')
composers1
composer birth death city
0 Mahler 1860 1911 Kaliste
1 Beethoven 1770 1827 Bonn
2 Puccini 1858 1924 Lucques
3 Shostakovich 1906 1975 Saint-Petersburg
composers2
last name first name
0 Puccini Giacomo
1 Beethoven Ludwig van
2 Brahms Johannes
3 Mahler Gustav
pd.merge(composers1, composers2, left_on='composer', right_on='last name', how='outer')
composer birth death city last name first name
0 Mahler 1860.0 1911.0 Kaliste Mahler Gustav
1 Beethoven 1770.0 1827.0 Bonn Beethoven Ludwig van
2 Puccini 1858.0 1924.0 Lucques Puccini Giacomo
3 Shostakovich 1906.0 1975.0 Saint-Petersburg NaN NaN
4 NaN NaN NaN NaN Brahms Johannes

The concepts are the same as in the join() function. The first table (composers1) is the left table, the second one (composers2) is the right table. And then we specify in left_on and right_on which columns to use for the merging. Finally, the howoption is the same.

Grouping#

Very often when you want to calculate some statistics on a dataset you need to group some of the data. Let’s look at our penguin dataset:

import pandas as pd
penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')
penguins.head(3)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
penguins.species.unique()
array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)

We see that we have data for three different species of Penguins. What if we want to have the average weight of penguins by species ? One solution would be to make a for loop e.g.:

weights = []
for species in penguins.species.unique():
    mean_w = penguins.loc[penguins['species'] == species,'body_mass_g'].mean()
    weights.append(mean_w)
weights
[3700.662251655629, 5076.016260162602, 3733.0882352941176]

Pandas offers a much simpler solution for this kind of operation with the .groupby function. Here we only specifiy which columns should be used to form groups and it does the work for us:

grouped_penguins = penguins.groupby('species')
grouped_penguins
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12f02b700>

The output is not directly visible. It’s a collection of tables grouped according to species. Many functions allow us to get one of these subgroups, e.g.:

grouped_penguins.get_group('Adelie')
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
... ... ... ... ... ... ... ... ...
147 Adelie Dream 36.6 18.4 184.0 3475.0 female 2009
148 Adelie Dream 36.0 17.8 195.0 3450.0 female 2009
149 Adelie Dream 37.8 18.1 193.0 3750.0 male 2009
150 Adelie Dream 36.0 17.1 187.0 3700.0 female 2009
151 Adelie Dream 41.5 18.5 201.0 4000.0 male 2009

152 rows × 8 columns

However we wont go here through the details of the group data structure. What we can simply do is magically apply functions directly on the grouped object. For example to get the mean:

grouped_penguins.mean()
/var/folders/mk/632_7fgs4v374qc935pvf9v00000gn/T/ipykernel_95390/491459327.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  grouped_penguins.mean()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
species
Adelie 38.791391 18.346358 189.953642 3700.662252 2008.013158
Chinstrap 48.833824 18.420588 195.823529 3733.088235 2007.970588
Gentoo 47.504878 14.982114 217.186992 5076.016260 2008.080645

This is a good example of a warning for how we are supposed to use a function. Here the behavior is changing and the developers are warning us of a future change that might break our code. We can try to fix it:

grouped_penguins.mean(numeric_only=True)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
species
Adelie 38.791391 18.346358 189.953642 3700.662252 2008.013158
Chinstrap 48.833824 18.420588 195.823529 3733.088235 2007.970588
Gentoo 47.504878 14.982114 217.186992 5076.016260 2008.080645

As you can see Pandas automatically computes the mean for each category and each column. The output then is a DataFrame where each line corresponds to a given category. One can push this further by using e.g. multiple columns for grouping, but this goes beyond the present course.

Exercises#

  1. Load a first DataFrame that contains information about covid19 vaccination locations: owid/covid-19-data

  2. Load a second DataFrame which contains daily information about vaccination for every country: owid/covid-19-data

  3. Create a sub-dataframe of the table in (1) that only contains the fields location and vaccines.

  4. Merge the two tables using the location field as key for merging. Use once left merge and once right marge.

  5. Do you see a difference in the result? 5b. Bonus question: knowing that you can use the .unique() method on columns, and that there’s a function called np.isin that compares two lists, can you figure out which country is missing ?

  6. Using the table from (1), group the data by location and calculate the average number of daily_vaccinations