10. Combining and grouping DataFrames

10. Combining and grouping DataFrames#

import numpy as np
import pandas as pd

Two of the most important operations one can do with DataFrames is 1) combine multiple sources of information and 2) efficiently group and summarize information. Both of these topics are vast and here again, we ony present a few important approaches.

Concatenation#

The simplest case where we need to combine multiple sources of information is if those sources are of the same “type”, i.e. they have the same columns but different entries. Imagine for example lists of participants to an event broken in muliple files with names starting with A-K, L-S etc. In that case we can just “glue” datasets together or more formally concatenate them. Here we load to Excel sheets with composer information:

file_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'
composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')
composers1

	birth	death	city
composer
Mahler	1860	1911	Kaliste
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg

composers2 = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true', index_col='composer',sheet_name='Sheet3')
composers2

	birth	death	city
composer
Verdi	1813	1901	Roncole
Dvorak	1841	1904	Nelahozeves
Schumann	1810	1856	Zwickau
Stravinsky	1882	1971	Oranienbaum
Mahler	1860	1911	Kaliste

We see that we have the same information in both tables, but for different composers. We can just concetenate them using the pd.concat function and providing the two DataFrames in a list:

all_composers = pd.concat([composers1,composers2])

all_composers

	birth	death	city
composer
Mahler	1860	1911	Kaliste
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg
Verdi	1813	1901	Roncole
Dvorak	1841	1904	Nelahozeves
Schumann	1810	1856	Zwickau
Stravinsky	1882	1971	Oranienbaum
Mahler	1860	1911	Kaliste

One potential problem is that two tables contain duplicated information, like here the Mahler entry. Pandas offers a lot of useful functions to clean-up data, for example drop_duplicates():

all_composers.drop_duplicates()

	birth	death	city
composer
Mahler	1860	1911	Kaliste
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg
Verdi	1813	1901	Roncole
Dvorak	1841	1904	Nelahozeves
Schumann	1810	1856	Zwickau
Stravinsky	1882	1971	Oranienbaum

Joining two tables#

A slightly more complex case is if we have two datasets with (almost) the same items, in our case composers, but with different information that we want to combine. For example:

composers1 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet1')
composers1

	birth	death	city
composer
Mahler	1860	1911	Kaliste
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg

composers2 = pd.read_excel(file_url, index_col='composer',sheet_name='Sheet4')
composers2

	first name
composer
Mahler	Gustav
Beethoven	Ludwig van
Puccini	Giacomo
Brahms	Johannes

We cannot concatenate the tables as they have different columns. However we can go through both tables and merge for each item the information of both tables. Note that each table has one element that doesn’t appear in the other table (Shostakovitch and Brahms).

We can do this merging operation with the join function, which one uses like this left_table.join(right_table):

composers1.join(composers2)

	birth	death	city	first name
composer
Mahler	1860	1911	Kaliste	Gustav
Beethoven	1770	1827	Bonn	Ludwig van
Puccini	1858	1924	Lucques	Giacomo
Shostakovich	1906	1975	Saint-Petersburg	NaN

We see that:

Pandas used the index to merge all elements
Brahms which was missing in composers1 is absent
The first name of Shostakovitch is missing since that composer was absent in composers2.

In other terms, join used all the elements of the left table and only those matching from the right table. This is just a default that can be overriden using the how option.

In total we have four possibilities as illustrated here:

from IPython.display import Image
Image(url='https://github.com/guiwitz/ISDAwPython_day2/raw/master/images/left_right.jpeg',width=700)

keep all elements of the rigth table and complete with the matching ones from the left table:

composers1.join(composers2, how = 'right')

	birth	death	city	first name
composer
Mahler	1860.0	1911.0	Kaliste	Gustav
Beethoven	1770.0	1827.0	Bonn	Ludwig van
Puccini	1858.0	1924.0	Lucques	Giacomo
Brahms	NaN	NaN	NaN	Johannes

Keep all elements of both tables:

composers1.join(composers2, how = 'outer')

	birth	death	city	first name
composer
Beethoven	1770.0	1827.0	Bonn	Ludwig van
Brahms	NaN	NaN	NaN	Johannes
Mahler	1860.0	1911.0	Kaliste	Gustav
Puccini	1858.0	1924.0	Lucques	Giacomo
Shostakovich	1906.0	1975.0	Saint-Petersburg	NaN

Keep only elements present in both tables:

composers1.join(composers2, how = 'inner')

	birth	death	city	first name
composer
Mahler	1860	1911	Kaliste	Gustav
Beethoven	1770	1827	Bonn	Ludwig van
Puccini	1858	1924	Lucques	Giacomo

More general approach#

Here we used the index to know which items to combined together, but more generally one can use any column for merging. The pd.merge function offers more flexibility in this regard. We whos only one example here:

composers1 = pd.read_excel(file_url, sheet_name='Sheet1')
composers2 = pd.read_excel(file_url, sheet_name='Sheet6')

composers1

	composer	birth	death	city
0	Mahler	1860	1911	Kaliste
1	Beethoven	1770	1827	Bonn
2	Puccini	1858	1924	Lucques
3	Shostakovich	1906	1975	Saint-Petersburg

composers2

	last name	first name
0	Puccini	Giacomo
1	Beethoven	Ludwig van
2	Brahms	Johannes
3	Mahler	Gustav

pd.merge(composers1, composers2, left_on='composer', right_on='last name', how='outer')

	composer	birth	death	city	last name	first name
0	Mahler	1860.0	1911.0	Kaliste	Mahler	Gustav
1	Beethoven	1770.0	1827.0	Bonn	Beethoven	Ludwig van
2	Puccini	1858.0	1924.0	Lucques	Puccini	Giacomo
3	Shostakovich	1906.0	1975.0	Saint-Petersburg	NaN	NaN
4	NaN	NaN	NaN	NaN	Brahms	Johannes

The concepts are the same as in the join() function. The first table (composers1) is the left table, the second one (composers2) is the right table. And then we specify in left_on and right_on which columns to use for the merging. Finally, the howoption is the same.

Grouping#

Very often when you want to calculate some statistics on a dataset you need to group some of the data. Let’s look at our penguin dataset:

import pandas as pd
penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')

penguins.head(3)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007

penguins.species.unique()

array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)

We see that we have data for three different species of Penguins. What if we want to have the average weight of penguins by species ? One solution would be to make a for loop e.g.:

weights = []
for species in penguins.species.unique():
    mean_w = penguins.loc[penguins['species'] == species,'body_mass_g'].mean()
    weights.append(mean_w)

weights

[3700.662251655629, 5076.016260162602, 3733.0882352941176]

Pandas offers a much simpler solution for this kind of operation with the .groupby function. Here we only specifiy which columns should be used to form groups and it does the work for us:

grouped_penguins = penguins.groupby('species')

grouped_penguins

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12f02b700>

The output is not directly visible. It’s a collection of tables grouped according to species. Many functions allow us to get one of these subgroups, e.g.:

grouped_penguins.get_group('Adelie')

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
...	...	...	...	...	...	...	...	...
147	Adelie	Dream	36.6	18.4	184.0	3475.0	female	2009
148	Adelie	Dream	36.0	17.8	195.0	3450.0	female	2009
149	Adelie	Dream	37.8	18.1	193.0	3750.0	male	2009
150	Adelie	Dream	36.0	17.1	187.0	3700.0	female	2009
151	Adelie	Dream	41.5	18.5	201.0	4000.0	male	2009

152 rows × 8 columns

However we wont go here through the details of the group data structure. What we can simply do is magically apply functions directly on the grouped object. For example to get the mean:

grouped_penguins.mean()

/var/folders/mk/632_7fgs4v374qc935pvf9v00000gn/T/ipykernel_95390/491459327.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  grouped_penguins.mean()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
species
Adelie	38.791391	18.346358	189.953642	3700.662252	2008.013158
Chinstrap	48.833824	18.420588	195.823529	3733.088235	2007.970588
Gentoo	47.504878	14.982114	217.186992	5076.016260	2008.080645

This is a good example of a warning for how we are supposed to use a function. Here the behavior is changing and the developers are warning us of a future change that might break our code. We can try to fix it:

grouped_penguins.mean(numeric_only=True)

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
species
Adelie	38.791391	18.346358	189.953642	3700.662252	2008.013158
Chinstrap	48.833824	18.420588	195.823529	3733.088235	2007.970588
Gentoo	47.504878	14.982114	217.186992	5076.016260	2008.080645

As you can see Pandas automatically computes the mean for each category and each column. The output then is a DataFrame where each line corresponds to a given category. One can push this further by using e.g. multiple columns for grouping, but this goes beyond the present course.

Exercises#

Load a first DataFrame that contains information about covid19 vaccination locations: owid/covid-19-data
Load a second DataFrame which contains daily information about vaccination for every country: owid/covid-19-data
Create a sub-dataframe of the table in (1) that only contains the fields location and vaccines.
Merge the two tables using the location field as key for merging. Use once left merge and once right marge.
Do you see a difference in the result? 5b. Bonus question: knowing that you can use the .unique() method on columns, and that there’s a function called np.isin that compares two lists, can you figure out which country is missing ?
Using the table from (1), group the data by location and calculate the average number of daily_vaccinations