14. Seaborn introduction#
We now switch to another plotting library particularly popular for statistics. It is built on top of Matplotlib so you will see that you can re-use a lot of what you have just learned. The creation of plots and the way to handle data input is however very different, with Matplotlib being very explicity (and lengthy) while Seaborn is implicit.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
diams = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Diamond.csv')
For the sake of simpicity, we only keep three colours from the dataset:
diams = diams[(diams['colour']=='D') | (diams['colour']=='E') | (diams['colour']=='G')]
Matplotlib vs. Seaborn#
Let’s remember how we plotted the price vs. carat in Matplotlib and want a different marker for each colour:
fig, ax = plt.subplots()
ax.plot(diams[diams['colour']=='D']['carat'], diams[diams['colour']=='D']['price'], 'bo', alpha=.5);
ax.plot(diams[diams['colour']=='E']['carat'], diams[diams['colour']=='E']['price'], 'b*', alpha=.5);
ax.plot(diams[diams['colour']=='G']['carat'], diams[diams['colour']=='G']['price'], 'bD', alpha=.5);
In seaborn, we don’t have to manually “isolate” each colour. We can simply indicate:
which DataFrame we want to use as
data
sourcewhich columns from the DataFrame should be represented on the
x
andy
axishow the dimanond colour variable should be represented, e.g. as colour or marker
style
We can pass all these parameters to the scatterplot
function:
ax = sns.scatterplot(data=diams, x='carat', y='price', style='colour');
As you can see, in addition of getting all data plotted at once with a specific marker, we get automated labelling of both axis using the column label as well as a legend.
Interaction with Matplotlib#
Previously we always created first a figure and axis object. seaborn automatically creates an axis object, but if we want to come back to our previous logic, we can directly pass an existing ax object to the seaborn function and come back to our previous notation. Then we can use these as usual, for example to add a title, set formats etc.:
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', style='colour', ax=ax);
ax.set_title('Diamonds');
Categorical variables#
In the example above we used the colour
column to assign a specific marker in the plot. The colour
here represents a category in contrast to the weight for example which is a continuous variable. Categorical variables can be used to set different aspects of the plot also often called aesthetics (in ggplot for example). We have already seen the marker above, but it can also be the color called hue
in seaborn:
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='colour', ax=ax);
or it can even be both:
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='colour', style='colour', ax=ax);
Finally, we can also use a variable to define the size of the marker, even though in this case it doesn’t make much sense to use the color to define the size:
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', size='colour', ax=ax);
Note that sometimes we want to use a numerical columns as a category. For example we can add a random column to the dataframe and use it for coloring:
diams['random'] = np.random.randint(0,3, len(diams))
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='random', ax=ax);
We see that since seaborn believes it is dealing with a numerical feature, the color scale is progressive and not ideal to distinguish actual categories. In such a case we can explicitly change the column to a category thanks to Pandas:
diams['random'] = pd.Categorical(diams['random'])
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='random', ax=ax);
Other inputs#
We have seen until now how we could plot data from a DataFrame. However sometimes we want to use seaborn if we only have for example Numpy arrays. For example we have x
values, a function that generates y = x**2
values and an array that contains categories cat
:
x_array = np.arange(0, 10, 0.5)
y_array = x_array ** 2
cat = np.random.randint(0,3,len(y_array))
Now we can simply directly pass the arrays:
sns.scatterplot(x=x_array, y=y_array, hue=cat);
Here again, we could turn the cat
array into an actual category so that the color map is better suited:
sns.scatterplot(x=x_array, y=y_array, hue=pd.Categorical(cat));
If we still want to enjoy the seaborn “goodies” such as automated axis labeling, we can also use a dictionary as input:
sns.scatterplot(data = {'x var': x_array, 'y var': y_array, 'my cat': pd.Categorical(cat)},
x= 'x var', y='y var', hue='my cat');
Combining aesthetics#
We can go one step further now and use simultaneously different groupings. For example we can assign the diamond color to hue
and the clarity to the style
:
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', hue='colour', style='clarity', ax=ax);
Other plots#
All plots in seaborn share the same logic. For example if we want to do a histogram of weigth by color we can use the
sns.histplot(data=diams, x='price', hue='colour', multiple='stack', stat='density');
Adjusting colours#
For almost every plot in seaborn, you can easily adjust the colour palette by using the palette
option and use one of the Matplotlib colour maps. For example we can change the histogram above to. Seaborn offers a very good discussion of which type of palette is most appropriate for each type of data.
sns.histplot(data=diams, x='price', hue='colour', multiple='stack', stat='density', palette='Set2');
You can also directly visualize palettes using the color_palette
function to help you choosing:
sns.color_palette('Set3')
Adjusting other properties#
Most of the time, seaborn has specific command to affect how data points are rendered. For example, you can override the default markers selected in a scatter plot:
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', style='colour', ax=ax, markers=['o','*','D']);
In most cases you can also in principle use the options that are available in the underlying Matplotlib function. You will see options called kwargs
and a message like this for example for the scatterplot
:
kwargs : key, value mappings
Other keyword arguments are passed down to matplotlib.axes.Axes.scatter.
For example we can use the s
option to set the marker size and edegcolor
and facecolor
to change makers colours:
fig, ax = plt.subplots()
ax = sns.scatterplot(data=diams, x='carat', y='price', style='colour', ax=ax, markers=['o','*','D'],
facecolor='yellow', s = 100, edgecolor='red');
Exercise#
Create a scatter plot of the body_mass_g
vs bill_depth_mm
using color and markers to distinguish species and sex as shown below. Try different palette
:
Also fix the labels and add a title.
penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')