11. Real-world problems and additional topics

11. Real-world problems and additional topics#

During this course we have either used simple demo examples designed to illustrate specific aspects of Pandas, or real-world data where we ignored some of the potentials problems. In this notebook, we address a few of the common problems that you might encounter. We also address some other common situations that you might encounter such as dealing with specific formats like dates.

import pandas as pd
import numpy as np

Absent values#

Very often, data that you need are imperfect: they migth have missing values or some values might be “incorrect”, e.g. in a column supposed to contain numbers you suddently encounter text. Pandas offers many ways to fix these problems.

Let’s once again import our penguin data:

penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')

penguins.shape

(344, 8)

penguins.head(5)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

Here we see that some values are missing and filled with NaN. As we have seen previously, Pandas is capable of handling NaN values e.g. when calculating statistics. But sometimes we want to remove those values. We can for example use .dropna() for this:

penguins_clean = penguins.dropna()

penguins_clean.head(5)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male	2007

We see that for example the index 3 has disappeared. .dropna() drops each line that has an NaN in any column.

You can also be more precise and check for NaN in a given column:

penguins.bill_length_mm.isna()

    False
    False
    False
     True
    False
       ...  
  False
  False
  False
  False
  False
Name: bill_length_mm, Length: 344, dtype: bool

Using indexing, we can now remove only lines which have NA in that column:

penguins_clean2 = penguins[~penguins.bill_length_mm.isna()]
penguins_clean2.head(5)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male	2007

Replacing bad values#

When importing a table, Pandas is capable of recognizing specific values as being NaN. For example the text NaN or missing values are turned in “offficial” NaN. But often there are just mistakes in the table. For example:

composer_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'
composers = pd.read_excel(composer_url, sheet_name=1)

composers

	composer	birth	death	city
0	Mahler	1860.0	1911	Kaliste
1	Beethoven	1770.0	1827	Bonn
2	Puccini	1858.0	1924	Lucques
3	Shostakovich	1906.0	1975	Saint-Petersburg
4	Sibelius	10.0	unknown	unknown
5	Haydn	NaN	NaN	Röhrau

Here we see that we have both NaN values and other “bad” values in the form of uknown text. This is a problem because now the columns containing that text are not “number-columns” anymore and for example we can’t do operations:

np.log(composers['death'])

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
AttributeError: 'int' object has no attribute 'log'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.log(composers['death'])

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/generic.py:2113, in NDFrame.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   2109 @final
   2110 def __array_ufunc__(
   2111     self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
   2112 ):
-> 2113     return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/arraylike.py:402, in array_ufunc(self, ufunc, method, *inputs, **kwargs)
    399 elif self.ndim == 1:
    400     # ufunc(series, ...)
    401     inputs = tuple(extract_array(x, extract_numpy=True) for x in inputs)
--> 402     result = getattr(ufunc, method)(*inputs, **kwargs)
    403 else:
    404     # ufunc(dataframe)
    405     if method == "__call__" and not kwargs:
    406         # for np.<ufunc>(..) calls
    407         # kwargs cannot necessarily be handled block-by-block, so only
    408         # take this path if there are no kwargs

TypeError: loop of ufunc does not support argument 0 of type int which has no callable log method

We therefore have first to clean-up the table. One solution is to replace all bade values. For that we can use .replace:

composers_clean = composers.replace('unknown', np.nan)
composers_clean

	composer	birth	death	city
0	Mahler	1860.0	1911.0	Kaliste
1	Beethoven	1770.0	1827.0	Bonn
2	Puccini	1858.0	1924.0	Lucques
3	Shostakovich	1906.0	1975.0	Saint-Petersburg
4	Sibelius	10.0	NaN	NaN
5	Haydn	NaN	NaN	Röhrau

np.log(composers_clean['death'])

  7.555382
  7.510431
  7.562162
  7.588324
       NaN
       NaN
Name: death, dtype: float64

Alternatively we can fix this problem at import time as well because we might have a lsit of all “bad entries”. For that we can use the option

pd.read_excel(composer_url, na_values=['unknown'], sheet_name=1)

	composer	birth	death	city
0	Mahler	1860.0	1911.0	Kaliste
1	Beethoven	1770.0	1827.0	Bonn
2	Puccini	1858.0	1924.0	Lucques
3	Shostakovich	1906.0	1975.0	Saint-Petersburg
4	Sibelius	10.0	NaN	NaN
5	Haydn	NaN	NaN	Röhrau

Dealing with time#

Pandas offers many possibilities to help dealing with data containing time, e.g. to calculate how many days have passed between two dates.

composers = pd.read_excel(composer_url)

composers

	composer	birth	death	city
0	Mahler	1860	1911	Kaliste
1	Beethoven	1770	1827	Bonn
2	Puccini	1858	1924	Lucques
3	Shostakovich	1906	1975	Saint-Petersburg

We transform now the birth and death columns into datetime objects using pd.to_datetime and by specifying a format, here just a year %Y:

composers['birth'] = pd.to_datetime(composers['birth'], format = '%Y')
composers['death'] = pd.to_datetime(composers['death'], format = '%Y')
composers

	composer	birth	death	city
0	Mahler	1860-01-01	1911-01-01	Kaliste
1	Beethoven	1770-01-01	1827-01-01	Bonn
2	Puccini	1858-01-01	1924-01-01	Lucques
3	Shostakovich	1906-01-01	1975-01-01	Saint-Petersburg

You can see that the year has been turned into an actual date (using 1st January as default). We create now an age column:

composers['age'] = (composers.death - composers.birth)
composers

	composer	birth	death	city	age
0	Mahler	1860-01-01	1911-01-01	Kaliste	18627 days
1	Beethoven	1770-01-01	1827-01-01	Bonn	20818 days
2	Puccini	1858-01-01	1924-01-01	Lucques	24105 days
3	Shostakovich	1906-01-01	1975-01-01	Saint-Petersburg	25202 days

Since the columns are datetime objects, Pandas know that it should actually compute difference in dates and by defaults gives days. Here the computation is trivial, but if you deal with formats such as HH:MM:SS this is very useful.

11. Real-world problems and additional topics

Contents

11. Real-world problems and additional topics#

Absent values#

Replacing bad values#

Dealing with time#