11. Real-world problems and additional topics#
During this course we have either used simple demo examples designed to illustrate specific aspects of Pandas, or real-world data where we ignored some of the potentials problems. In this notebook, we address a few of the common problems that you might encounter. We also address some other common situations that you might encounter such as dealing with specific formats like dates.
import pandas as pd
import numpy as np
Absent values#
Very often, data that you need are imperfect: they migth have missing values or some values might be “incorrect”, e.g. in a column supposed to contain numbers you suddently encounter text. Pandas offers many ways to fix these problems.
Let’s once again import our penguin data:
penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')
penguins.shape
(344, 8)
penguins.head(5)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
Here we see that some values are missing and filled with NaN
. As we have seen previously, Pandas is capable of handling NaN
values e.g. when calculating statistics. But sometimes we want to remove those values. We can for example use .dropna()
for this:
penguins_clean = penguins.dropna()
penguins_clean.head(5)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | male | 2007 |
We see that for example the index 3 has disappeared. .dropna()
drops each line that has an NaN
in any column.
You can also be more precise and check for NaN
in a given column:
penguins.bill_length_mm.isna()
0 False
1 False
2 False
3 True
4 False
...
339 False
340 False
341 False
342 False
343 False
Name: bill_length_mm, Length: 344, dtype: bool
Using indexing, we can now remove only lines which have NA
in that column:
penguins_clean2 = penguins[~penguins.bill_length_mm.isna()]
penguins_clean2.head(5)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | male | 2007 |
Replacing bad values#
When importing a table, Pandas is capable of recognizing specific values as being NaN
. For example the text NaN or missing values are turned in “offficial” NaN
. But often there are just mistakes in the table. For example:
composer_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'
composers = pd.read_excel(composer_url, sheet_name=1)
composers
composer | birth | death | city | |
---|---|---|---|---|
0 | Mahler | 1860.0 | 1911 | Kaliste |
1 | Beethoven | 1770.0 | 1827 | Bonn |
2 | Puccini | 1858.0 | 1924 | Lucques |
3 | Shostakovich | 1906.0 | 1975 | Saint-Petersburg |
4 | Sibelius | 10.0 | unknown | unknown |
5 | Haydn | NaN | NaN | Röhrau |
Here we see that we have both NaN
values and other “bad” values in the form of uknown
text. This is a problem because now the columns containing that text are not “number-columns” anymore and for example we can’t do operations:
np.log(composers['death'])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
AttributeError: 'int' object has no attribute 'log'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.log(composers['death'])
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/generic.py:2113, in NDFrame.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
2109 @final
2110 def __array_ufunc__(
2111 self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
2112 ):
-> 2113 return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/arraylike.py:402, in array_ufunc(self, ufunc, method, *inputs, **kwargs)
399 elif self.ndim == 1:
400 # ufunc(series, ...)
401 inputs = tuple(extract_array(x, extract_numpy=True) for x in inputs)
--> 402 result = getattr(ufunc, method)(*inputs, **kwargs)
403 else:
404 # ufunc(dataframe)
405 if method == "__call__" and not kwargs:
406 # for np.<ufunc>(..) calls
407 # kwargs cannot necessarily be handled block-by-block, so only
408 # take this path if there are no kwargs
TypeError: loop of ufunc does not support argument 0 of type int which has no callable log method
We therefore have first to clean-up the table. One solution is to replace all bade values. For that we can use .replace
:
composers_clean = composers.replace('unknown', np.nan)
composers_clean
composer | birth | death | city | |
---|---|---|---|---|
0 | Mahler | 1860.0 | 1911.0 | Kaliste |
1 | Beethoven | 1770.0 | 1827.0 | Bonn |
2 | Puccini | 1858.0 | 1924.0 | Lucques |
3 | Shostakovich | 1906.0 | 1975.0 | Saint-Petersburg |
4 | Sibelius | 10.0 | NaN | NaN |
5 | Haydn | NaN | NaN | Röhrau |
np.log(composers_clean['death'])
0 7.555382
1 7.510431
2 7.562162
3 7.588324
4 NaN
5 NaN
Name: death, dtype: float64
Alternatively we can fix this problem at import time as well because we might have a lsit of all “bad entries”. For that we can use the option
pd.read_excel(composer_url, na_values=['unknown'], sheet_name=1)
composer | birth | death | city | |
---|---|---|---|---|
0 | Mahler | 1860.0 | 1911.0 | Kaliste |
1 | Beethoven | 1770.0 | 1827.0 | Bonn |
2 | Puccini | 1858.0 | 1924.0 | Lucques |
3 | Shostakovich | 1906.0 | 1975.0 | Saint-Petersburg |
4 | Sibelius | 10.0 | NaN | NaN |
5 | Haydn | NaN | NaN | Röhrau |
Dealing with time#
Pandas offers many possibilities to help dealing with data containing time, e.g. to calculate how many days have passed between two dates.
composers = pd.read_excel(composer_url)
composers
composer | birth | death | city | |
---|---|---|---|---|
0 | Mahler | 1860 | 1911 | Kaliste |
1 | Beethoven | 1770 | 1827 | Bonn |
2 | Puccini | 1858 | 1924 | Lucques |
3 | Shostakovich | 1906 | 1975 | Saint-Petersburg |
We transform now the birth
and death
columns into datetime
objects using pd.to_datetime
and by specifying a format, here just a year %Y
:
composers['birth'] = pd.to_datetime(composers['birth'], format = '%Y')
composers['death'] = pd.to_datetime(composers['death'], format = '%Y')
composers
composer | birth | death | city | |
---|---|---|---|---|
0 | Mahler | 1860-01-01 | 1911-01-01 | Kaliste |
1 | Beethoven | 1770-01-01 | 1827-01-01 | Bonn |
2 | Puccini | 1858-01-01 | 1924-01-01 | Lucques |
3 | Shostakovich | 1906-01-01 | 1975-01-01 | Saint-Petersburg |
You can see that the year has been turned into an actual date (using 1st January as default). We create now an age column:
composers['age'] = (composers.death - composers.birth)
composers
composer | birth | death | city | age | |
---|---|---|---|---|---|
0 | Mahler | 1860-01-01 | 1911-01-01 | Kaliste | 18627 days |
1 | Beethoven | 1770-01-01 | 1827-01-01 | Bonn | 20818 days |
2 | Puccini | 1858-01-01 | 1924-01-01 | Lucques | 24105 days |
3 | Shostakovich | 1906-01-01 | 1975-01-01 | Saint-Petersburg | 25202 days |
Since the columns are datetime objects, Pandas know that it should actually compute difference in dates and by defaults gives days. Here the computation is trivial, but if you deal with formats such as HH:MM:SS
this is very useful.