8. Operating on DataFrames#
We have seen in the very first chapter that we could easily import CSV or Excel sheets as DataFrames in Python. We have also seen that those dataframes are essentially two-dimensional tables where each element can be located via an index and a column name. We have also seen that each column is in fact a Numpy array and have learned how to handle and process those arrays.
Here we will show how most of the concepts we learned for Numpy are directly translateable into Pandas.
import pandas as pd
import numpy as np
We here again use our composer data:
composers = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true')
composers
composer | birth | death | city | |
---|---|---|---|---|
0 | Mahler | 1860 | 1911 | Kaliste |
1 | Beethoven | 1770 | 1827 | Bonn |
2 | Puccini | 1858 | 1924 | Lucques |
3 | Shostakovich | 1906 | 1975 | Saint-Petersburg |
Just like with Numpy arrays, we can apply various mathematical operations on DataFrames columns or entire DataFrames.
Arithmetics single columns#
Just like we could do an operation like my_array = my_array + 1
, we can do the same with Pandas columns. Doing such operations, we directly reaplace an existing column or create new one like here:
composers['death_100'] = composers['death'] + 100
composers
composer | birth | death | city | death_100 | |
---|---|---|---|---|---|
0 | Mahler | 1860 | 1911 | Kaliste | 2011 |
1 | Beethoven | 1770 | 1827 | Bonn | 1927 |
2 | Puccini | 1858 | 1924 | Lucques | 2024 |
3 | Shostakovich | 1906 | 1975 | Saint-Petersburg | 2075 |
Arithmetics multiple columns#
We can also do maths with multiple columns as we have previously done with arrays. Just like with arrays, operations are performed element-wise. We can for example compute the composers’ age at death:
composers['death']-composers['birth']
0 51
1 57
2 66
3 69
dtype: int64
Note that the output here is a series with the correct length, so we can simply add it to our original table directly!
composers['age'] = composers['death']-composers['birth']
composers
composer | birth | death | city | death_100 | age | |
---|---|---|---|---|---|---|
0 | Mahler | 1860 | 1911 | Kaliste | 2011 | 51 |
1 | Beethoven | 1770 | 1827 | Bonn | 1927 | 57 |
2 | Puccini | 1858 | 1924 | Lucques | 2024 | 66 |
3 | Shostakovich | 1906 | 1975 | Saint-Petersburg | 2075 | 69 |
Applying functions#
Again, just like we could previously apply a function to an array and recover an array of the same size where the function had been applied element-wise we can do that same with a DataFrame column and recover a column. And we can simply use Numpy functions!
np.log10(composers['birth'])
0 3.269513
1 3.247973
2 3.269046
3 3.280123
Name: birth, dtype: float64
Applying array wide#
Not that you should be careful if you want to apply a function to more than one column. For example if we want to apply the above operation to the entire array:
np.log(composers)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
AttributeError: 'str' object has no attribute 'log'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[8], line 1
----> 1 np.log(composers)
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/generic.py:2113, in NDFrame.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
2109 @final
2110 def __array_ufunc__(
2111 self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
2112 ):
-> 2113 return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/arraylike.py:410, in array_ufunc(self, ufunc, method, *inputs, **kwargs)
405 if method == "__call__" and not kwargs:
406 # for np.<ufunc>(..) calls
407 # kwargs cannot necessarily be handled block-by-block, so only
408 # take this path if there are no kwargs
409 mgr = inputs[0]._mgr
--> 410 result = mgr.apply(getattr(ufunc, method))
411 else:
412 # otherwise specific ufunc methods (eg np.<ufunc>.accumulate(..))
413 # Those can have an axis keyword and thus can't be called block-by-block
414 result = default_array_ufunc(inputs[0], ufunc, method, *inputs, **kwargs)
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/internals/managers.py:350, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
348 try:
349 if callable(f):
--> 350 applied = b.apply(f, **kwargs)
351 else:
352 applied = getattr(b, f)(**kwargs)
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/internals/blocks.py:351, in Block.apply(self, func, **kwargs)
345 @final
346 def apply(self, func, **kwargs) -> list[Block]:
347 """
348 apply the function to my values; return a block if we are not
349 one
350 """
--> 351 result = func(self.values, **kwargs)
353 return self._split_op_result(result)
TypeError: loop of ufunc does not support argument 0 of type str which has no callable log method
We get an error because some columns are not numerical and therefore Pandas doesn’t know what to do. If we only operate on numerical columns this however works:
sub_df = composers[['birth', 'death']]
sub_df
birth | death | |
---|---|---|
0 | 1860 | 1911 |
1 | 1770 | 1827 |
2 | 1858 | 1924 |
3 | 1906 | 1975 |
np.log10(sub_df)
birth | death | |
---|---|---|
0 | 3.269513 | 3.281261 |
1 | 3.247973 | 3.261739 |
2 | 3.269046 | 3.284205 |
3 | 3.280123 | 3.295567 |
The apply
function#
There is also a more general way of applying custom functions using the apply()
function. There are many ways of using this function: on the full DataFrame, on single columns, with a single output value, a series etc. Here we look at the simplest case, where we apply a given function to each element of a given column, thus generating a new modified column.
First we need to define a custom function. Here we want a function to tell us whether a composer is born before or after 1900 by looking at the birth
column. So we define a standard function that takes a year as input and returns True or False:
def age_fun(age):
if age < 1900:
return False
else:
return True
Let’s test it:
age_fun(1700)
False
age_fun(1950)
True
Now we simply pass this function to the apply
method on a specific column:
composers['birth'].apply(age_fun)
0 False
1 False
2 False
3 True
Name: birth, dtype: bool
Again we can directly create a new columns with this output:
composers['20th century'] = composers['birth'].apply(age_fun)
composers
composer | birth | death | city | death_100 | age | 20th century | |
---|---|---|---|---|---|---|---|
0 | Mahler | 1860 | 1911 | Kaliste | 2011 | 51 | False |
1 | Beethoven | 1770 | 1827 | Bonn | 1927 | 57 | False |
2 | Puccini | 1858 | 1924 | Lucques | 2024 | 66 | False |
3 | Shostakovich | 1906 | 1975 | Saint-Petersburg | 2075 | 69 | True |
Summarizing functions#
Finally, we can summarized columns by using either Numpy function or the methods attached to a Dataframe or a column. For example we can take the mean:
np.mean(composers['birth'])
1848.5
composers['birth'].mean()
1848.5
Exercise#
Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
Create a new columns called
body_mass_kg
by transforming thebody_mass_g
column into kilograms.Create a new column containiing the division the
bill_length_mm
by thebill_depth_mm
columnCalculate the median value of the
body_mass_g
column both by using a Numpy function and by using the method attached to the column. What do you observe? Can you infer what the problem is?