8. Operating on DataFrames

8. Operating on DataFrames#

We have seen in the very first chapter that we could easily import CSV or Excel sheets as DataFrames in Python. We have also seen that those dataframes are essentially two-dimensional tables where each element can be located via an index and a column name. We have also seen that each column is in fact a Numpy array and have learned how to handle and process those arrays.

Here we will show how most of the concepts we learned for Numpy are directly translateable into Pandas.

import pandas as pd
import numpy as np

We here again use our composer data:

composers = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true')

composers

	composer	birth	death	city
0	Mahler	1860	1911	Kaliste
1	Beethoven	1770	1827	Bonn
2	Puccini	1858	1924	Lucques
3	Shostakovich	1906	1975	Saint-Petersburg

Just like with Numpy arrays, we can apply various mathematical operations on DataFrames columns or entire DataFrames.

Arithmetics single columns#

Just like we could do an operation like my_array = my_array + 1, we can do the same with Pandas columns. Doing such operations, we directly reaplace an existing column or create new one like here:

composers['death_100'] = composers['death'] + 100
composers

	composer	birth	death	city	death_100
0	Mahler	1860	1911	Kaliste	2011
1	Beethoven	1770	1827	Bonn	1927
2	Puccini	1858	1924	Lucques	2024
3	Shostakovich	1906	1975	Saint-Petersburg	2075

Arithmetics multiple columns#

We can also do maths with multiple columns as we have previously done with arrays. Just like with arrays, operations are performed element-wise. We can for example compute the composers’ age at death:

composers['death']-composers['birth']

  51
  57
  66
  69
dtype: int64

Note that the output here is a series with the correct length, so we can simply add it to our original table directly!

composers['age'] = composers['death']-composers['birth']
composers

	composer	birth	death	city	death_100	age
0	Mahler	1860	1911	Kaliste	2011	51
1	Beethoven	1770	1827	Bonn	1927	57
2	Puccini	1858	1924	Lucques	2024	66
3	Shostakovich	1906	1975	Saint-Petersburg	2075	69

Applying functions#

Again, just like we could previously apply a function to an array and recover an array of the same size where the function had been applied element-wise we can do that same with a DataFrame column and recover a column. And we can simply use Numpy functions!

np.log10(composers['birth'])

  3.269513
  3.247973
  3.269046
  3.280123
Name: birth, dtype: float64

Applying array wide#

Not that you should be careful if you want to apply a function to more than one column. For example if we want to apply the above operation to the entire array:

np.log(composers)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
AttributeError: 'str' object has no attribute 'log'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 np.log(composers)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/generic.py:2113, in NDFrame.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   2109 @final
   2110 def __array_ufunc__(
   2111     self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
   2112 ):
-> 2113     return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/arraylike.py:410, in array_ufunc(self, ufunc, method, *inputs, **kwargs)
    405 if method == "__call__" and not kwargs:
    406     # for np.<ufunc>(..) calls
    407     # kwargs cannot necessarily be handled block-by-block, so only
    408     # take this path if there are no kwargs
    409     mgr = inputs[0]._mgr
--> 410     result = mgr.apply(getattr(ufunc, method))
    411 else:
    412     # otherwise specific ufunc methods (eg np.<ufunc>.accumulate(..))
    413     # Those can have an axis keyword and thus can't be called block-by-block
    414     result = default_array_ufunc(inputs[0], ufunc, method, *inputs, **kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/internals/managers.py:350, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
    348 try:
    349     if callable(f):
--> 350         applied = b.apply(f, **kwargs)
    351     else:
    352         applied = getattr(b, f)(**kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/internals/blocks.py:351, in Block.apply(self, func, **kwargs)
    345 @final
    346 def apply(self, func, **kwargs) -> list[Block]:
    347     """
    348     apply the function to my values; return a block if we are not
    349     one
    350     """
--> 351     result = func(self.values, **kwargs)
    353     return self._split_op_result(result)

TypeError: loop of ufunc does not support argument 0 of type str which has no callable log method

We get an error because some columns are not numerical and therefore Pandas doesn’t know what to do. If we only operate on numerical columns this however works:

sub_df = composers[['birth', 'death']]
sub_df

	birth	death
0	1860	1911
1	1770	1827
2	1858	1924
3	1906	1975

np.log10(sub_df)

	birth	death
0	3.269513	3.281261
1	3.247973	3.261739
2	3.269046	3.284205
3	3.280123	3.295567

The `apply` function#

There is also a more general way of applying custom functions using the apply() function. There are many ways of using this function: on the full DataFrame, on single columns, with a single output value, a series etc. Here we look at the simplest case, where we apply a given function to each element of a given column, thus generating a new modified column.

First we need to define a custom function. Here we want a function to tell us whether a composer is born before or after 1900 by looking at the birth column. So we define a standard function that takes a year as input and returns True or False:

def age_fun(age):
    if age < 1900:
        return False
    else:
        return True

Let’s test it:

age_fun(1700)

False

age_fun(1950)

True

Now we simply pass this function to the apply method on a specific column:

composers['birth'].apply(age_fun)

  False
  False
  False
   True
Name: birth, dtype: bool

Again we can directly create a new columns with this output:

composers['20th century'] = composers['birth'].apply(age_fun)
composers

	composer	birth	death	city	death_100	age	20th century
0	Mahler	1860	1911	Kaliste	2011	51	False
1	Beethoven	1770	1827	Bonn	1927	57	False
2	Puccini	1858	1924	Lucques	2024	66	False
3	Shostakovich	1906	1975	Saint-Petersburg	2075	69	True

Summarizing functions#

Finally, we can summarized columns by using either Numpy function or the methods attached to a Dataframe or a column. For example we can take the mean:

np.mean(composers['birth'])

1848.5

composers['birth'].mean()

1848.5

Exercise#

Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
Create a new columns called body_mass_kg by transforming the body_mass_g column into kilograms.
Create a new column containiing the division the bill_length_mm by the bill_depth_mm column
Calculate the median value of the body_mass_g column both by using a Numpy function and by using the method attached to the column. What do you observe? Can you infer what the problem is?