15. Seaborn: distributions and relations#

As we have just seen, seaborn allows to draw very quickly complex plots in order to explore data. Here we further explore the capabilities offered by the packages to quickly visualize relations and ditributions

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

diams = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Diamond.csv', index_col=0)

Generalistic functions#

Seaborn has seversl generalistic functions offering a framework to display ditributions or relations. They essentially build on simpler functions like scatterplot or histplot but can be quickly used to switch visualization types.

As we have already seen options for scatter plots, we here we explore catplot which gives access to a long list of other plotting functions such as boxplots, swarmplots etc. all working with the same syntax.

Let’s see what de default settings produce when we want to visualize the price distribution y for each color x:

g = sns.catplot(data=diams, x='colour', y='price');
_images/899fc7b1897621cdb25d0004e3f680cee4b6bd5862bba8e7e5a255e3ee1c76d4.png

The result is a very reasonable stripplot, a scatter plot with one categorical variable. As you can see some jitter is added to help visualize all points. Note that the output of catplot is a facetgrid object, not a simple axis. We wont explore in details facet grids, but just know that you can still acces the axes and the figure composing this object with:

g
<seaborn.axisgrid.FacetGrid at 0x15abe1690>
g.ax
<Axes: xlabel='colour', ylabel='price'>
g.fig
_images/899fc7b1897621cdb25d0004e3f680cee4b6bd5862bba8e7e5a255e3ee1c76d4.png

Customization#

As usual, you have many options to optimize the plot, some of them specific to the current type such as jitter here:

sns.catplot(data=diams, x='colour', y='price', jitter=0.3);
_images/1a5316cecc3ca9f8436971aa62c6e7a64d83fd007edfd3f05244bf5b66a8c0b6.png
diams.head()
carat colour clarity certification price
1 0.30 D VS2 GIA 1302
2 0.30 E VS1 GIA 1510
3 0.30 G VVS1 GIA 1510
4 0.30 G VS1 GIA 1260
5 0.31 D VS1 GIA 1641

Also, just like in the regular plots, we can use aesthetics. We can for example here use the hue to represent the certification:

sns.catplot(data=diams, x='colour', y='price', hue='certification', jitter=.2);
_images/5d43658ecd986fdebbb01b125fce1b19245db52d69b01538fdca3c03e2646e9c.png

Picking the type of plot#

The catplot has one essential parameter, kind, which allows us to select alternative types of plots. For example we can produce box plots or violin plots:

sns.catplot(data=diams, x='colour', y='price', kind='box');
_images/8ab85fda9a86ea33d308dd42b65afe2efca1e49dbf965119004e07bdbb2c557b.png
sns.catplot(data=diams, x='colour', y='price', kind='violin');
_images/dff4b0e54c89252c8d0eade245fc32aee588a651db7494b5c780be1ce2ad8fff.png

Of course in each case, you could use the corresponing base function directly. For example for the boxplot function. Keep in mind that all these plots rely on Matplotlib, so usually, in addition to seaborn options, you can also use the appropriate Matplotlib options. For example here the boxprops option:

boxprops = dict(linewidth=3, color='red')
sns.boxplot(data=diams, x='colour', y='price', boxprops=boxprops);
_images/5091d4b103b85fd6304267efd70d1afc3bd0c1aa9831e32666c2c5208cef9952.png

Other plots#

catplot gives access to many more functions as visible here. For example you can easily also easily create a count histogram:

sns.catplot(data=diams, x='colour', kind='count');
_images/930050eba7bc8c5a3b885e553fe8f73482d2e9c37630ae81fc08d5d2562cafb7.png

Combining plots#

It is often interesing to mix multiple plots. For example here we might want to see box plots and swarmplots overlapping to better visualize the distribution. Here again we first create an axis object that we can repeatedly use afterwards. We also use here the base functions:

fig, ax = plt.subplots()
sns.boxplot(data=diams, x='colour', y='price', ax=ax);
sns.stripplot(data=diams, x='colour', y='price', color='k',ax=ax);
_images/ded5779faca7a29eaf544a27f360d18a9907551e516fe2a734722225e88885d8.png

2D distributions#

We have seen until know only 1D distributions as histograms and plotted 2D relations as scatterplots. We can however also represent 2D distributions directly, which is particularly helpful if datapoints are very dense. For that we can simply use the histplot function, but with both x and y parameters.

sns.histplot(diams, x="carat", y="price");
_images/cf1e867a9c8367acb3e7fc53ed6a9920c01c74488759f8858c3a4236e3d2e5b6.png

Alternatively we can also use here the more general displot function, which functions in the same way as the catplot in that we can choose the kind of plot we want such as hist or kde:

sns.displot(diams, x="carat", y="price");
_images/511f267cfbd2e86f53704ce684aeaa4726dad79e977cb1ee3c8bc686ee05c937.png
sns.displot(diams, x="carat", y="price", kind='kde');
_images/1ae14d4f8a40b9d15d9eeb3b8b6081a7d062c11f20a7f273ffc6f9ef38d9549a.png

Exploring with pairplots and jointplots#

Seaborn also offers a very efficient way to check relationships between multiple variables of a dataset with the pairplot and jointplot functions.

By default pairplot producess a grid with histograms and scatterplots for all pairs of varaibles:

sns.pairplot(diams);
_images/390ddf9361121cab6a853c6fb1d7e42ac827426a848096eef8a20b1e669fbb9a.png

This can be adjuste of course. For example we could use a kde on the diagonal for example:

sns.pairplot(diams, diag_kind='kde');
_images/92fa591178e4598cbd77ea1407796eccc75cb4f26c9f903737de4e024f02a85b.png

The jointplot functions allows to show at the same time the relation between two variables and their distribution:

sns.jointplot(data=diams, x="carat", y="price");
_images/aaab062c1d2072974179589bffa20f04e866e1975e455f836ff3d442b4cb1c34.png

Exercise#

  1. Knowing that the type of plot below has the kind swarm for swarmplot, try to reproduce the figure below.

  2. Using displot show the 2D distribution of bill length and bill depth as shown in the second plot.

penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')
penguins.head(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007