8. Operating on DataFrames#

We have seen in the very first chapter that we could easily import CSV or Excel sheets as DataFrames in Python. We have also seen that those dataframes are essentially two-dimensional tables where each element can be located via an index and a column name. We have also seen that each column is in fact a Numpy array and have learned how to handle and process those arrays.

Here we will show how most of the concepts we learned for Numpy are directly translateable into Pandas.

import pandas as pd
import numpy as np

We here again use our composer data:

composers = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true')
composers
composer birth death city
0 Mahler 1860 1911 Kaliste
1 Beethoven 1770 1827 Bonn
2 Puccini 1858 1924 Lucques
3 Shostakovich 1906 1975 Saint-Petersburg

Just like with Numpy arrays, we can apply various mathematical operations on DataFrames columns or entire DataFrames.

Arithmetics single columns#

Just like we could do an operation like my_array = my_array + 1, we can do the same with Pandas columns. Doing such operations, we directly reaplace an existing column or create new one like here:

composers['death_100'] = composers['death'] + 100
composers
composer birth death city death_100
0 Mahler 1860 1911 Kaliste 2011
1 Beethoven 1770 1827 Bonn 1927
2 Puccini 1858 1924 Lucques 2024
3 Shostakovich 1906 1975 Saint-Petersburg 2075

Arithmetics multiple columns#

We can also do maths with multiple columns as we have previously done with arrays. Just like with arrays, operations are performed element-wise. We can for example compute the composers’ age at death:

composers['death']-composers['birth']
0    51
1    57
2    66
3    69
dtype: int64

Note that the output here is a series with the correct length, so we can simply add it to our original table directly!

composers['age'] = composers['death']-composers['birth']
composers
composer birth death city death_100 age
0 Mahler 1860 1911 Kaliste 2011 51
1 Beethoven 1770 1827 Bonn 1927 57
2 Puccini 1858 1924 Lucques 2024 66
3 Shostakovich 1906 1975 Saint-Petersburg 2075 69

Applying functions#

Again, just like we could previously apply a function to an array and recover an array of the same size where the function had been applied element-wise we can do that same with a DataFrame column and recover a column. And we can simply use Numpy functions!

np.log10(composers['birth'])
0    3.269513
1    3.247973
2    3.269046
3    3.280123
Name: birth, dtype: float64

Applying array wide#

Not that you should be careful if you want to apply a function to more than one column. For example if we want to apply the above operation to the entire array:

np.log(composers)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
AttributeError: 'str' object has no attribute 'log'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 np.log(composers)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/generic.py:2113, in NDFrame.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   2109 @final
   2110 def __array_ufunc__(
   2111     self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
   2112 ):
-> 2113     return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/arraylike.py:410, in array_ufunc(self, ufunc, method, *inputs, **kwargs)
    405 if method == "__call__" and not kwargs:
    406     # for np.<ufunc>(..) calls
    407     # kwargs cannot necessarily be handled block-by-block, so only
    408     # take this path if there are no kwargs
    409     mgr = inputs[0]._mgr
--> 410     result = mgr.apply(getattr(ufunc, method))
    411 else:
    412     # otherwise specific ufunc methods (eg np.<ufunc>.accumulate(..))
    413     # Those can have an axis keyword and thus can't be called block-by-block
    414     result = default_array_ufunc(inputs[0], ufunc, method, *inputs, **kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/internals/managers.py:350, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
    348 try:
    349     if callable(f):
--> 350         applied = b.apply(f, **kwargs)
    351     else:
    352         applied = getattr(b, f)(**kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/internals/blocks.py:351, in Block.apply(self, func, **kwargs)
    345 @final
    346 def apply(self, func, **kwargs) -> list[Block]:
    347     """
    348     apply the function to my values; return a block if we are not
    349     one
    350     """
--> 351     result = func(self.values, **kwargs)
    353     return self._split_op_result(result)

TypeError: loop of ufunc does not support argument 0 of type str which has no callable log method

We get an error because some columns are not numerical and therefore Pandas doesn’t know what to do. If we only operate on numerical columns this however works:

sub_df = composers[['birth', 'death']]
sub_df
birth death
0 1860 1911
1 1770 1827
2 1858 1924
3 1906 1975
np.log10(sub_df)
birth death
0 3.269513 3.281261
1 3.247973 3.261739
2 3.269046 3.284205
3 3.280123 3.295567

The apply function#

There is also a more general way of applying custom functions using the apply() function. There are many ways of using this function: on the full DataFrame, on single columns, with a single output value, a series etc. Here we look at the simplest case, where we apply a given function to each element of a given column, thus generating a new modified column.

First we need to define a custom function. Here we want a function to tell us whether a composer is born before or after 1900 by looking at the birth column. So we define a standard function that takes a year as input and returns True or False:

def age_fun(age):
    if age < 1900:
        return False
    else:
        return True

Let’s test it:

age_fun(1700)
False
age_fun(1950)
True

Now we simply pass this function to the apply method on a specific column:

composers['birth'].apply(age_fun)
0    False
1    False
2    False
3     True
Name: birth, dtype: bool

Again we can directly create a new columns with this output:

composers['20th century'] = composers['birth'].apply(age_fun)
composers
composer birth death city death_100 age 20th century
0 Mahler 1860 1911 Kaliste 2011 51 False
1 Beethoven 1770 1827 Bonn 1927 57 False
2 Puccini 1858 1924 Lucques 2024 66 False
3 Shostakovich 1906 1975 Saint-Petersburg 2075 69 True

Summarizing functions#

Finally, we can summarized columns by using either Numpy function or the methods attached to a Dataframe or a column. For example we can take the mean:

np.mean(composers['birth'])
1848.5
composers['birth'].mean()
1848.5

Exercise#

  1. Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv

  2. Create a new columns called body_mass_kg by transforming the body_mass_g column into kilograms.

  3. Create a new column containiing the division the bill_length_mm by the bill_depth_mm column

  4. Calculate the median value of the body_mass_g column both by using a Numpy function and by using the method attached to the column. What do you observe? Can you infer what the problem is?