10. Combining and grouping DataFrames#
import numpy as np
import pandas as pd
Two of the most important operations one can do with DataFrames is 1) combine multiple sources of information and 2) efficiently group and summarize information. Both of these topics are vast and here again, we ony present a few important approaches.
Concatenation#
The simplest case where we need to combine multiple sources of information is if those sources are of the same “type”, i.e. they have the same columns but different entries. Imagine for example lists of participants to an event broken in muliple files with names starting with A-K, L-S etc. In that case we can just “glue” datasets together or more formally concatenate them. Here we load to Excel sheets with composer information:
file_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'
composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')
composers1
birth | death | city | |
---|---|---|---|
composer | |||
Mahler | 1860 | 1911 | Kaliste |
Beethoven | 1770 | 1827 | Bonn |
Puccini | 1858 | 1924 | Lucques |
Shostakovich | 1906 | 1975 | Saint-Petersburg |
composers2 = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true', index_col='composer',sheet_name='Sheet3')
composers2
birth | death | city | |
---|---|---|---|
composer | |||
Verdi | 1813 | 1901 | Roncole |
Dvorak | 1841 | 1904 | Nelahozeves |
Schumann | 1810 | 1856 | Zwickau |
Stravinsky | 1882 | 1971 | Oranienbaum |
Mahler | 1860 | 1911 | Kaliste |
We see that we have the same information in both tables, but for different composers. We can just concetenate them using the pd.concat
function and providing the two DataFrames in a list:
all_composers = pd.concat([composers1,composers2])
all_composers
birth | death | city | |
---|---|---|---|
composer | |||
Mahler | 1860 | 1911 | Kaliste |
Beethoven | 1770 | 1827 | Bonn |
Puccini | 1858 | 1924 | Lucques |
Shostakovich | 1906 | 1975 | Saint-Petersburg |
Verdi | 1813 | 1901 | Roncole |
Dvorak | 1841 | 1904 | Nelahozeves |
Schumann | 1810 | 1856 | Zwickau |
Stravinsky | 1882 | 1971 | Oranienbaum |
Mahler | 1860 | 1911 | Kaliste |
One potential problem is that two tables contain duplicated information, like here the Mahler entry. Pandas offers a lot of useful functions to clean-up data, for example drop_duplicates()
:
all_composers.drop_duplicates()
birth | death | city | |
---|---|---|---|
composer | |||
Mahler | 1860 | 1911 | Kaliste |
Beethoven | 1770 | 1827 | Bonn |
Puccini | 1858 | 1924 | Lucques |
Shostakovich | 1906 | 1975 | Saint-Petersburg |
Verdi | 1813 | 1901 | Roncole |
Dvorak | 1841 | 1904 | Nelahozeves |
Schumann | 1810 | 1856 | Zwickau |
Stravinsky | 1882 | 1971 | Oranienbaum |
Joining two tables#
A slightly more complex case is if we have two datasets with (almost) the same items, in our case composers, but with different information that we want to combine. For example:
composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')
composers1
birth | death | city | |
---|---|---|---|
composer | |||
Mahler | 1860 | 1911 | Kaliste |
Beethoven | 1770 | 1827 | Bonn |
Puccini | 1858 | 1924 | Lucques |
Shostakovich | 1906 | 1975 | Saint-Petersburg |
composers2 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet4')
composers2
first name | |
---|---|
composer | |
Mahler | Gustav |
Beethoven | Ludwig van |
Puccini | Giacomo |
Brahms | Johannes |
We cannot concatenate the tables as they have different columns. However we can go through both tables and merge for each item the information of both tables. Note that each table has one element that doesn’t appear in the other table (Shostakovitch and Brahms).
We can do this merging operation with the join
function, which one uses like this left_table.join(right_table)
:
composers1.join(composers2)
birth | death | city | first name | |
---|---|---|---|---|
composer | ||||
Mahler | 1860 | 1911 | Kaliste | Gustav |
Beethoven | 1770 | 1827 | Bonn | Ludwig van |
Puccini | 1858 | 1924 | Lucques | Giacomo |
Shostakovich | 1906 | 1975 | Saint-Petersburg | NaN |
We see that:
Pandas used the index to merge all elements
Brahms which was missing in
composers1
is absentThe first name of Shostakovitch is missing since that composer was absent in
composers2
.
In other terms, join
used all the elements of the left
table and only those matching from the right
table. This is just a default that can be overriden using the how
option.
In total we have four possibilities as illustrated here:
from IPython.display import Image
Image(url='https://github.com/guiwitz/ISDAwPython_day2/raw/master/images/left_right.jpeg',width=700)
keep all elements of the
rigth
table and complete with the matching ones from theleft
table:
composers1.join(composers2, how = 'right')
birth | death | city | first name | |
---|---|---|---|---|
composer | ||||
Mahler | 1860.0 | 1911.0 | Kaliste | Gustav |
Beethoven | 1770.0 | 1827.0 | Bonn | Ludwig van |
Puccini | 1858.0 | 1924.0 | Lucques | Giacomo |
Brahms | NaN | NaN | NaN | Johannes |
Keep all elements of both tables:
composers1.join(composers2, how = 'outer')
birth | death | city | first name | |
---|---|---|---|---|
composer | ||||
Beethoven | 1770.0 | 1827.0 | Bonn | Ludwig van |
Brahms | NaN | NaN | NaN | Johannes |
Mahler | 1860.0 | 1911.0 | Kaliste | Gustav |
Puccini | 1858.0 | 1924.0 | Lucques | Giacomo |
Shostakovich | 1906.0 | 1975.0 | Saint-Petersburg | NaN |
Keep only elements present in both tables:
composers1.join(composers2, how = 'inner')
birth | death | city | first name | |
---|---|---|---|---|
composer | ||||
Mahler | 1860 | 1911 | Kaliste | Gustav |
Beethoven | 1770 | 1827 | Bonn | Ludwig van |
Puccini | 1858 | 1924 | Lucques | Giacomo |
More general approach#
Here we used the index to know which items to combined together, but more generally one can use any column for merging. The pd.merge
function offers more flexibility in this regard. We whos only one example here:
composers1 = pd.read_excel(file_url, sheet_name='Sheet1')
composers2 = pd.read_excel(file_url, sheet_name='Sheet6')
composers1
composer | birth | death | city | |
---|---|---|---|---|
0 | Mahler | 1860 | 1911 | Kaliste |
1 | Beethoven | 1770 | 1827 | Bonn |
2 | Puccini | 1858 | 1924 | Lucques |
3 | Shostakovich | 1906 | 1975 | Saint-Petersburg |
composers2
last name | first name | |
---|---|---|
0 | Puccini | Giacomo |
1 | Beethoven | Ludwig van |
2 | Brahms | Johannes |
3 | Mahler | Gustav |
pd.merge(composers1, composers2, left_on='composer', right_on='last name', how='outer')
composer | birth | death | city | last name | first name | |
---|---|---|---|---|---|---|
0 | Mahler | 1860.0 | 1911.0 | Kaliste | Mahler | Gustav |
1 | Beethoven | 1770.0 | 1827.0 | Bonn | Beethoven | Ludwig van |
2 | Puccini | 1858.0 | 1924.0 | Lucques | Puccini | Giacomo |
3 | Shostakovich | 1906.0 | 1975.0 | Saint-Petersburg | NaN | NaN |
4 | NaN | NaN | NaN | NaN | Brahms | Johannes |
The concepts are the same as in the join()
function. The first table (composers1
) is the left
table, the second one (composers2
) is the right
table. And then we specify in left_on
and right_on
which columns to use for the merging. Finally, the how
option is the same.
Grouping#
Very often when you want to calculate some statistics on a dataset you need to group some of the data. Let’s look at our penguin dataset:
import pandas as pd
penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')
penguins.head(3)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
penguins.species.unique()
array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)
We see that we have data for three different species of Penguins. What if we want to have the average weight of penguins by species ? One solution would be to make a for loop e.g.:
weights = []
for species in penguins.species.unique():
mean_w = penguins.loc[penguins['species'] == species,'body_mass_g'].mean()
weights.append(mean_w)
weights
[3700.662251655629, 5076.016260162602, 3733.0882352941176]
Pandas offers a much simpler solution for this kind of operation with the .groupby
function. Here we only specifiy which columns should be used to form groups and it does the work for us:
grouped_penguins = penguins.groupby('species')
grouped_penguins
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12f02b700>
The output is not directly visible. It’s a collection of tables grouped according to species. Many functions allow us to get one of these subgroups, e.g.:
grouped_penguins.get_group('Adelie')
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
147 | Adelie | Dream | 36.6 | 18.4 | 184.0 | 3475.0 | female | 2009 |
148 | Adelie | Dream | 36.0 | 17.8 | 195.0 | 3450.0 | female | 2009 |
149 | Adelie | Dream | 37.8 | 18.1 | 193.0 | 3750.0 | male | 2009 |
150 | Adelie | Dream | 36.0 | 17.1 | 187.0 | 3700.0 | female | 2009 |
151 | Adelie | Dream | 41.5 | 18.5 | 201.0 | 4000.0 | male | 2009 |
152 rows × 8 columns
However we wont go here through the details of the group data structure. What we can simply do is magically apply functions directly on the grouped object. For example to get the mean:
grouped_penguins.mean()
/var/folders/mk/632_7fgs4v374qc935pvf9v00000gn/T/ipykernel_95390/491459327.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
grouped_penguins.mean()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year | |
---|---|---|---|---|---|
species | |||||
Adelie | 38.791391 | 18.346358 | 189.953642 | 3700.662252 | 2008.013158 |
Chinstrap | 48.833824 | 18.420588 | 195.823529 | 3733.088235 | 2007.970588 |
Gentoo | 47.504878 | 14.982114 | 217.186992 | 5076.016260 | 2008.080645 |
This is a good example of a warning for how we are supposed to use a function. Here the behavior is changing and the developers are warning us of a future change that might break our code. We can try to fix it:
grouped_penguins.mean(numeric_only=True)
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year | |
---|---|---|---|---|---|
species | |||||
Adelie | 38.791391 | 18.346358 | 189.953642 | 3700.662252 | 2008.013158 |
Chinstrap | 48.833824 | 18.420588 | 195.823529 | 3733.088235 | 2007.970588 |
Gentoo | 47.504878 | 14.982114 | 217.186992 | 5076.016260 | 2008.080645 |
As you can see Pandas automatically computes the mean for each category and each column. The output then is a DataFrame where each line corresponds to a given category. One can push this further by using e.g. multiple columns for grouping, but this goes beyond the present course.
Exercises#
Load a first DataFrame that contains information about covid19 vaccination locations: owid/covid-19-data
Load a second DataFrame which contains daily information about vaccination for every country: owid/covid-19-data
Create a sub-dataframe of the table in (1) that only contains the fields
location
andvaccines
.Merge the two tables using the
location
field as key for merging. Use onceleft
merge and onceright
marge.Do you see a difference in the result? 5b. Bonus question: knowing that you can use the
.unique()
method on columns, and that there’s a function callednp.isin
that compares two lists, can you figure out which country is missing ?Using the table from (1), group the data by
location
and calculate the average number ofdaily_vaccinations