14. Seaborn introduction#

We now switch to another plotting library particularly popular for statistics. It is built on top of Matplotlib so you will see that you can re-use a lot of what you have just learned. The creation of plots and the way to handle data input is however very different, with Matplotlib being very explicity (and lengthy) while Seaborn is implicit.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
diams = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Diamond.csv')

For the sake of simpicity, we only keep three colours from the dataset:

diams = diams[(diams['colour']=='D') | (diams['colour']=='E') | (diams['colour']=='G')]

Matplotlib vs. Seaborn#

Let’s remember how we plotted the price vs. carat in Matplotlib and want a different marker for each colour:

fig, ax = plt.subplots()
ax.plot(diams[diams['colour']=='D']['carat'], diams[diams['colour']=='D']['price'], 'bo', alpha=.5);
ax.plot(diams[diams['colour']=='E']['carat'], diams[diams['colour']=='E']['price'], 'b*', alpha=.5);
ax.plot(diams[diams['colour']=='G']['carat'], diams[diams['colour']=='G']['price'], 'bD', alpha=.5);
_images/4c47fe93f2e7eedb63695d638173b3bd164e974166cde3738858c4a69252a883.png

In seaborn, we don’t have to manually “isolate” each colour. We can simply indicate:

  1. which DataFrame we want to use as data source

  2. which columns from the DataFrame should be represented on the x and y axis

  3. how the dimanond colour variable should be represented, e.g. as colour or marker style

We can pass all these parameters to the scatterplot function:

ax = sns.scatterplot(data=diams, x='carat', y='price', style='colour');
_images/4006cc77c1d304a702f10466ae1e988471ec6c9924ef20dd18f9785588b0f1e1.png

As you can see, in addition of getting all data plotted at once with a specific marker, we get automated labelling of both axis using the column label as well as a legend.

Interaction with Matplotlib#

Previously we always created first a figure and axis object. seaborn automatically creates an axis object, but if we want to come back to our previous logic, we can directly pass an existing ax object to the seaborn function and come back to our previous notation. Then we can use these as usual, for example to add a title, set formats etc.:

fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', style='colour', ax=ax);
ax.set_title('Diamonds');
_images/2a8ee139c2c3077f44a3be621882515e5900cdc80965fd03297e8c4ab3b8a8eb.png

Categorical variables#

In the example above we used the colour column to assign a specific marker in the plot. The colour here represents a category in contrast to the weight for example which is a continuous variable. Categorical variables can be used to set different aspects of the plot also often called aesthetics (in ggplot for example). We have already seen the marker above, but it can also be the color called hue in seaborn:

fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='colour', ax=ax);
_images/8de9715599c91fef3d78c09c1c9014f787529db0b8833e53e2d5c858292d48f1.png

or it can even be both:

fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='colour', style='colour', ax=ax);
_images/6baa8041aea20e9bf4b4d0e21afa3a1834369dae7079d9a5b3c382836fc0fe25.png

Finally, we can also use a variable to define the size of the marker, even though in this case it doesn’t make much sense to use the color to define the size:

fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', size='colour', ax=ax);
_images/eaf9fe77676c55ac8e2e08a557430a53b4370295de490e88a3ee1023a3cca45d.png

Note that sometimes we want to use a numerical columns as a category. For example we can add a random column to the dataframe and use it for coloring:

diams['random'] = np.random.randint(0,3, len(diams))
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='random', ax=ax);
_images/e8c5e6e7c18498701d085bec0e3be2916019b7387b4ea1ffabeb5151f2dac0a9.png

We see that since seaborn believes it is dealing with a numerical feature, the color scale is progressive and not ideal to distinguish actual categories. In such a case we can explicitly change the column to a category thanks to Pandas:

diams['random'] = pd.Categorical(diams['random'])
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='random', ax=ax);
_images/481b124df49940e3d011ea351b78edc039a4942775a6b9b6a71ae5104958733a.png

Other inputs#

We have seen until now how we could plot data from a DataFrame. However sometimes we want to use seaborn if we only have for example Numpy arrays. For example we have x values, a function that generates y = x**2 values and an array that contains categories cat:

x_array = np.arange(0, 10, 0.5)
y_array = x_array ** 2
cat = np.random.randint(0,3,len(y_array))

Now we can simply directly pass the arrays:

sns.scatterplot(x=x_array, y=y_array, hue=cat);
_images/67fa49f8f1416fd50b907ab9315e12d7396381f4aa68292d2443bf3a982179f2.png

Here again, we could turn the cat array into an actual category so that the color map is better suited:

sns.scatterplot(x=x_array, y=y_array, hue=pd.Categorical(cat));
_images/73d9ed16821eb9f286689127b2167478230155038b9f38880e3d2d8ec9b470c9.png

If we still want to enjoy the seaborn “goodies” such as automated axis labeling, we can also use a dictionary as input:

sns.scatterplot(data = {'x var': x_array, 'y var': y_array, 'my cat': pd.Categorical(cat)},
                x= 'x var', y='y var', hue='my cat');
                        
_images/b8d1a6b9368878a5855b0da80780c217e1468eb40f8a46cdacd489f7fa0bf7f3.png

Combining aesthetics#

We can go one step further now and use simultaneously different groupings. For example we can assign the diamond color to hue and the clarity to the style:

fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='colour', style='clarity', ax=ax);
_images/11f7711c8cf59539a9aefc09fb70f7f594a22d30ceff22ea7e251f74e8579e21.png

Other plots#

All plots in seaborn share the same logic. For example if we want to do a histogram of weigth by color we can use the

sns.histplot(data=diams, x='price', hue='colour', multiple='stack', stat='density');
_images/9b8dd4ccb7f40110e2280a7545d428c7f84806fa660e43109bcd43c79e5bedfa.png

Adjusting colours#

For almost every plot in seaborn, you can easily adjust the colour palette by using the palette option and use one of the Matplotlib colour maps. For example we can change the histogram above to. Seaborn offers a very good discussion of which type of palette is most appropriate for each type of data.

sns.histplot(data=diams, x='price', hue='colour', multiple='stack', stat='density', palette='Set2');
_images/a12ac46f84d9f20ea94e3e2f5b7fdd7d959fc55ea9dcbf287665fec5bb9e7f6a.png

You can also directly visualize palettes using the color_palette function to help you choosing:

sns.color_palette('Set3')

Adjusting other properties#

Most of the time, seaborn has specific command to affect how data points are rendered. For example, you can override the default markers selected in a scatter plot:

fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', style='colour', ax=ax, markers=['o','*','D']);
_images/58f3343c202cc2f885c0872bb0373bdc03b2cc948655531a5be41197ab888e72.png

In most cases you can also in principle use the options that are available in the underlying Matplotlib function. You will see options called kwargs and a message like this for example for the scatterplot:

kwargs : key, value mappings

Other keyword arguments are passed down to matplotlib.axes.Axes.scatter.

For example we can use the s option to set the marker size and edegcolor and facecolor to change makers colours:

fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', style='colour', ax=ax, markers=['o','*','D'],
                     facecolor='yellow', s = 100, edgecolor='red');
_images/c2e3c0fbd111ea199b415cbcd64742ec9a92a5ae9105f5077b6d4962aa2fe2aa.png

Exercise#

Create a scatter plot of the body_mass_g vs bill_depth_mm using color and markers to distinguish species and sex as shown below. Try different palette:

Also fix the labels and add a title.

penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')