9. DataFrame indexing#
We have seen two ways of doing indexing in Numpy: with numerical indices and with boolean arrays. We will see that we can apply a very similar approach to DataFrame indexing. However we have one additional component that we didn’t have before: while elements in a Numpy array can be purely located by their position (numerical indices), elements of a DataFrame can additionally be located thanks to their column name and index (not necessarily numerical).
To understand this, we have to explore a bit further the “anatomy” of a dataframe. Let’s first load one:
import pandas as pd
import numpy as np
composers = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true')
composers
| composer | birth | death | city | |
|---|---|---|---|---|
| 0 | Mahler | 1860 | 1911 | Kaliste | 
| 1 | Beethoven | 1770 | 1827 | Bonn | 
| 2 | Puccini | 1858 | 1924 | Lucques | 
| 3 | Shostakovich | 1906 | 1975 | Saint-Petersburg | 
Accessing columns#
We have already see how to access single columns e.g.:
composers['birth']
0    1860
1    1770
2    1858
3    1906
Name: birth, dtype: int64
We can also access muliple columns by specifying a list of those that we need:
composers[['birth', 'city']]
| birth | city | |
|---|---|---|
| 0 | 1860 | Kaliste | 
| 1 | 1770 | Bonn | 
| 2 | 1858 | Lucques | 
| 3 | 1906 | Saint-Petersburg | 
As you can see, as soon as we have more than one column, the returned object is a DataFrame.
Accessing indices#
We have seen an example with Numpy where we accessed elements in a 2D array using a pair of indices e.g.:
my_array = np.random.normal(size=(3,5))
my_array
array([[-0.27943032,  0.53917334,  0.04905016, -0.29523685, -0.32803448],
       [ 1.29974059, -0.87046224,  0.56457039, -0.16125738,  1.90088289],
       [ 1.18116165,  0.95418676,  0.47466928,  1.12268135, -0.74403994]])
my_array[2,1]
0.9541867643923373
Even though our DataFrame is a two dimensional object, we can’t access its elements in the same way. For example we cannot recover the top left element (Mahler) by using:
composers[0,0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: (0, 0)
The above exception was the direct cause of the following exception:
KeyError                                  Traceback (most recent call last)
Cell In[8], line 1
----> 1 composers[0,0]
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)
KeyError: (0, 0)
As Pandas has columns names and indices (the bold numbers on the left of the DataFrame), there are specific functions to access elements either by using those values or directly by numerical indexing.
loc#
The .loc[index, name] method allows us to access a specific element situated at a specific index (row) and name (column). For example:
composers.loc[3, 'city']
'Saint-Petersburg'
We can also recover all the items of a given index by only speciying the latter:
composers.loc[3]
composer        Shostakovich
birth                   1906
death                   1975
city        Saint-Petersburg
Name: 3, dtype: object
We see that this returns a simple Series with all the items defined in the DataFrame. Just like with the columns we can also pass a list of indices to recover more than one line, in which case we recover a DataFrame:
composers.loc[[1,3], ['city','birth']]
| city | birth | |
|---|---|---|
| 1 | Bonn | 1770 | 
| 3 | Saint-Petersburg | 1906 | 
Note that the index does not have to be a sequential integer. For example we can replace the index with a list of strings:
composers.index = np.array(['a','b','c','d'])
composers
| composer | birth | death | city | |
|---|---|---|---|---|
| a | Mahler | 1860 | 1911 | Kaliste | 
| b | Beethoven | 1770 | 1827 | Bonn | 
| c | Puccini | 1858 | 1924 | Lucques | 
| d | Shostakovich | 1906 | 1975 | Saint-Petersburg | 
Here we can still use the .loc method even though we don’t deal with integers
composers.loc[['a','c']]
| composer | birth | death | city | |
|---|---|---|---|---|
| a | Mahler | 1860 | 1911 | Kaliste | 
| c | Puccini | 1858 | 1924 | Lucques | 
iloc#
The alternative to the .loc[index, name] method is the .iloc[row, column] method. This method is closer to the Numpy approach, as here we can use the actual location in the DataFrame to recover elements. For example to recover the city of Puccini, we need the third line (row index = 2) and the fourth columns (column index = 3):
composers
| composer | birth | death | city | |
|---|---|---|---|---|
| a | Mahler | 1860 | 1911 | Kaliste | 
| b | Beethoven | 1770 | 1827 | Bonn | 
| c | Puccini | 1858 | 1924 | Lucques | 
| d | Shostakovich | 1906 | 1975 | Saint-Petersburg | 
composers.iloc[2,3]
'Lucques'
With iloc we can now use the same indexing approach as seen with Numpy arrays, i.e. we can select a range of values:
composers.iloc[0:2,2:4]
| death | city | |
|---|---|---|
| a | 1911 | Kaliste | 
| b | 1827 | Bonn | 
Logical indexing#
We have seen before that we can create a boolean array by using for example a comparison such as:
my_array
array([[-0.27943032,  0.53917334,  0.04905016, -0.29523685, -0.32803448],
       [ 1.29974059, -0.87046224,  0.56457039, -0.16125738,  1.90088289],
       [ 1.18116165,  0.95418676,  0.47466928,  1.12268135, -0.74403994]])
my_array > 0
array([[False,  True,  True, False, False],
       [ True, False,  True, False,  True],
       [ True,  True,  True,  True, False]])
and then use it to extract the True elements in another array:
my_array[my_array > 0]
array([0.53917334, 0.04905016, 1.29974059, 0.56457039, 1.90088289,
       1.18116165, 0.95418676, 0.47466928, 1.12268135])
We can apply the same logic to Pandas DataFrames. For example, let’s select composers by data of birth. Let’s see what happens:
composers
| composer | birth | death | city | |
|---|---|---|---|---|
| a | Mahler | 1860 | 1911 | Kaliste | 
| b | Beethoven | 1770 | 1827 | Bonn | 
| c | Puccini | 1858 | 1924 | Lucques | 
| d | Shostakovich | 1906 | 1975 | Saint-Petersburg | 
composers['birth'] > 1859
a     True
b    False
c    False
d     True
Name: birth, dtype: bool
We see that we obtain just like we obtained a boolean array previously, we obtain here a boolean series. And here as well we can use it to extract elements from the DataFrame. This works again with simple brackets:
composers[composers['birth'] > 1859]
| composer | birth | death | city | |
|---|---|---|---|---|
| a | Mahler | 1860 | 1911 | Kaliste | 
| d | Shostakovich | 1906 | 1975 | Saint-Petersburg | 
As well as with the more complex function loc:
composers.loc[composers['birth'] > 1859, 'city']
a             Kaliste
d    Saint-Petersburg
Name: city, dtype: object
This is extremely useful to extract only specific elements from a large table if some values should be discarded
Exercises#
- Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv 
- Create a new dataframe - new_dataframeby extracting the- species,- bill_length_mmand- body_mass_gcolumns.
- Extract the row with index 4 of - new_dataframe
- Extract the - bill_length_mmof the 3 first rows of- new_dataframe
- Extract all rows for which the - body_mass_g > 6000