7. Back to Pandas#

We have started this course by importing a DataFrame, briefly exploring it and discovering that Numpy was underlying these structures. Then we explored in more details those Numpy arrays. Now we come back to these DataFrames and will see that we can re-use many concepts from Numpy in Pandas.

import pandas as pd
import numpy as np

Creating Dataframes#

We have seen that we could extract Numpy arrays from a DataFrame, but we can also do the reverse: create a DataFrame from an array. While most of the time you will import data as DataFrames, sometimes it is useful to explicitly create an DataFrame.

From a Numpy array#

We know how to create a 2D Numpy array, for example:

my_array = np.random.randint(0, 100, ((3,5)))
my_array
array([[69,  0,  1, 91, 15],
       [68, 89, 69, 58, 50],
       [28, 94, 55, 40, 63]])

We can simply transform this array into a DataFrame using the pd.DataFrame function:

my_df = pd.DataFrame(my_array)
my_df
0 1 2 3 4
0 69 0 1 91 15
1 68 89 69 58 50
2 28 94 55 40 63

We will see that many of the operations that we have used on Numpy arrays before also work for Pandas. As a start, you can for example get the shape of a DataFrame:

my_df.shape
(3, 5)

By default, Pandas uses a numerical index for each row and each column, but if we want we can override this by explicitly providing lists for the index and columns options.

my_df = pd.DataFrame(my_array, index=['row a','row b','row c'], columns=['col A', 'col B', 'col C','col D','col E'])
my_df
col A col B col C col D col E
row a 69 0 1 91 15
row b 68 89 69 58 50
row c 28 94 55 40 63

From a dictionary#

There are multiple ways to create DataFrames. One of the most “natural” ones is to transform a dictionary of lists into an array. Here each element of the dictionary defines a column. For example:

dict_of_list = {
    'birth': [1860, 1770, 1858, 1906],
    'death':[1911, 1827, 1924, 1975], 
    'city':['Kaliste', 'Bonn', 'Lucques', 'Saint-Petersburg']
    }
pd.DataFrame(dict_of_list)
birth death city
0 1860 1911 Kaliste
1 1770 1827 Bonn
2 1858 1924 Lucques
3 1906 1975 Saint-Petersburg

Index and column#

One of the main differences between Numpy arrays and Pandas DataFrames is that the elements in the table are not located by their “numerical position” but by their corresponding index and column. These two elements are in fact just list of values:

my_df.index
Index(['row a', 'row b', 'row c'], dtype='object')
my_df.columns
Index(['col A', 'col B', 'col C', 'col D', 'col E'], dtype='object')

Changing indices#

One can change these properties of the DataFrame directly by reassigning a new list. For example:

my_df.index = np.array(['x','y','z'])
my_df.columns = np.array(['new A','new B','new C','new D','new E'])
my_df
new A new B new C new D new E
x 69 0 1 91 15
y 68 89 69 58 50
z 28 94 55 40 63

Alternatively, one can use dedicated functions that give us more flexibility. For example we can change just one column or index name by using a dictionary old -> new and the .rename() method. We also have to specify which axis we want to rename, either the 'columns' or the 'rows':

my_df = my_df.rename({'new A': 're-name A'},axis='columns')
my_df
re-name A new B new C new D new E
x 69 0 1 91 15
y 68 89 69 58 50
z 28 94 55 40 63
my_df = my_df.rename({'x': 'new x'},axis='rows')
my_df
re-name A new B new C new D new E
new x 69 0 1 91 15
y 68 89 69 58 50
z 28 94 55 40 63

Adding columns#

It is also very easy to add a new column to an existing DataFrame. For that you just assign a list or a single value to a not yet existing column:

my_df['new column'] = 1
my_df
re-name A new B new C new D new E new column
new x 69 0 1 91 15 1
y 68 89 69 58 50 1
z 28 94 55 40 63 1

We have seen before that DataFrame can contain different types of data, so we can also fill a column with text:

my_df['text column'] = ['a', 'b', 'c']
my_df
re-name A new B new C new D new E new column text column
new x 69 0 1 91 15 1 a
y 68 89 69 58 50 1 b
z 28 94 55 40 63 1 c

Droping items#

Inversely we can remove columns by using the drop() function. Here also we have to say with axis if we want to affect 'rows' or 'columns':

my_df.drop(labels='new B', axis='columns')
re-name A new C new D new E new column text column
new x 69 1 91 15 1 a
y 68 69 58 50 1 b
z 28 55 40 63 1 c
my_df.drop(labels='y', axis='rows')
re-name A new B new C new D new E new column text column
new x 69 0 1 91 15 1 a
z 28 94 55 40 63 1 c

Importing data#

Until now we have always used the reader_XXX functions with default options. However these functions have many options, in particular regarding the choice of indices and column names. For example, we can specify if we want to use one of the columns as index when importing data. Each data format will present its own challenges, and we present here some core concepts.

Here we are using the dataset that you can find here. It is a simple collection of Excel sheet with information regarding classical composers. These simple and rather short dataset will allow us to explore many of the problems that you might encounter with “real” datasets.

First we need a new importer function for Excel:

file_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'
composers = pd.read_excel(file_url)
composers
composer birth death city
0 Mahler 1860 1911 Kaliste
1 Beethoven 1770 1827 Bonn
2 Puccini 1858 1924 Lucques
3 Shostakovich 1906 1975 Saint-Petersburg

Setting indices#

Again, by default Pandas sets a numerical index. However here each composer name is unique, so we could use that as an index. We can specify this to our importer with the index_col option:

composers = pd.read_excel(file_url, index_col='composer')
composers
birth death city
composer
Mahler 1860 1911 Kaliste
Beethoven 1770 1827 Bonn
Puccini 1858 1924 Lucques
Shostakovich 1906 1975 Saint-Petersburg

Skipping rows#

Instead of dropping certains rows at the start or end of the table, we can also directly do that at import time by specifying the skipfooter and skiprows arguments. Let’s for example remove the first line:

composers = pd.read_excel(file_url, index_col='composer', skiprows=[0])
composers
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 composers = pd.read_excel(file_url, index_col='composer', skiprows=[0])
      2 composers

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/excel/_base.py:490, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    484     raise ValueError(
    485         "Engine should not be specified when passing "
    486         "an ExcelFile - ExcelFile already has the engine set"
    487     )
    489 try:
--> 490     data = io.parse(
    491         sheet_name=sheet_name,
    492         header=header,
    493         names=names,
    494         index_col=index_col,
    495         usecols=usecols,
    496         squeeze=squeeze,
    497         dtype=dtype,
    498         converters=converters,
    499         true_values=true_values,
    500         false_values=false_values,
    501         skiprows=skiprows,
    502         nrows=nrows,
    503         na_values=na_values,
    504         keep_default_na=keep_default_na,
    505         na_filter=na_filter,
    506         verbose=verbose,
    507         parse_dates=parse_dates,
    508         date_parser=date_parser,
    509         thousands=thousands,
    510         decimal=decimal,
    511         comment=comment,
    512         skipfooter=skipfooter,
    513         convert_float=convert_float,
    514         mangle_dupe_cols=mangle_dupe_cols,
    515     )
    516 finally:
    517     # make sure to close opened file handles
    518     if should_close:

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/excel/_base.py:1734, in ExcelFile.parse(self, sheet_name, header, names, index_col, usecols, squeeze, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
   1700 def parse(
   1701     self,
   1702     sheet_name: str | int | list[int] | list[str] | None = 0,
   (...)
   1721     **kwds,
   1722 ) -> DataFrame | dict[str, DataFrame] | dict[int, DataFrame]:
   1723     """
   1724     Parse specified sheet(s) into a DataFrame.
   1725 
   (...)
   1732         DataFrame from the passed in Excel file.
   1733     """
-> 1734     return self._reader.parse(
   1735         sheet_name=sheet_name,
   1736         header=header,
   1737         names=names,
   1738         index_col=index_col,
   1739         usecols=usecols,
   1740         squeeze=squeeze,
   1741         converters=converters,
   1742         true_values=true_values,
   1743         false_values=false_values,
   1744         skiprows=skiprows,
   1745         nrows=nrows,
   1746         na_values=na_values,
   1747         parse_dates=parse_dates,
   1748         date_parser=date_parser,
   1749         thousands=thousands,
   1750         comment=comment,
   1751         skipfooter=skipfooter,
   1752         convert_float=convert_float,
   1753         mangle_dupe_cols=mangle_dupe_cols,
   1754         **kwds,
   1755     )

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/excel/_base.py:891, in BaseExcelReader.parse(self, sheet_name, header, names, index_col, usecols, squeeze, dtype, true_values, false_values, skiprows, nrows, na_values, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    865 try:
    866     parser = TextParser(
    867         data,
    868         names=names,
   (...)
    888         **kwds,
    889     )
--> 891     output[asheetname] = parser.read(nrows=nrows)
    893     if not squeeze or isinstance(output[asheetname], DataFrame):
    894         if header_names:

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1778, in TextFileReader.read(self, nrows)
   1771 nrows = validate_integer("nrows", nrows)
   1772 try:
   1773     # error: "ParserBase" has no attribute "read"
   1774     (
   1775         index,
   1776         columns,
   1777         col_dict,
-> 1778     ) = self._engine.read(  # type: ignore[attr-defined]
   1779         nrows
   1780     )
   1781 except Exception:
   1782     self.close()

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py:288, in PythonParser.read(self, rows)
    285 conv_data = self._convert_data(data)
    286 columns, conv_data = self._do_date_conversions(columns, conv_data)
--> 288 index, result_columns = self._make_index(
    289     conv_data, alldata, columns, indexnamerow
    290 )
    292 return index, result_columns, conv_data

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py:379, in ParserBase._make_index(self, data, alldata, columns, indexnamerow)
    376     index = None
    378 elif not self._has_complex_date_col:
--> 379     simple_index = self._get_simple_index(alldata, columns)
    380     index = self._agg_index(simple_index)
    381 elif self._has_complex_date_col:

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py:411, in ParserBase._get_simple_index(self, data, columns)
    409 index = []
    410 for idx in self.index_col:
--> 411     i = ix(idx)
    412     to_remove.append(i)
    413     index.append(data[i])

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py:406, in ParserBase._get_simple_index.<locals>.ix(col)
    404 if not isinstance(col, str):
    405     return col
--> 406 raise ValueError(f"Index {col} invalid")

ValueError: Index composer invalid

We ran into trouble, because we skipped the first row which in reality is just the row with the column names! To skip the actual first row we need:

composers = pd.read_excel(file_url, index_col='composer', skiprows=[1])
composers
birth death city
composer
Beethoven 1770 1827 Bonn
Puccini 1858 1924 Lucques
Shostakovich 1906 1975 Saint-Petersburg

We can also specify a list of columns that we actually want to use with the usecols option. For example we could skip the city:

composers = pd.read_excel(file_url, index_col='composer', usecols=['composer','birth', 'death'])
composers
birth death
composer
Mahler 1860 1911
Beethoven 1770 1827
Puccini 1858 1924
Shostakovich 1906 1975

Format-specific options#

Finally, for each importer, we have specific options only available in that format. For example with the Excel importer we can specify sheets to import. By default it imports the first one, but we can also indicate the position of the sheet or its name:

composers = pd.read_excel(file_url, sheet_name=4)
composers
composer birth death period country
0 Mahler 1860 1911.0 post-romantic Austria
1 Beethoven 1770 1827.0 romantic Germany
2 Puccini 1858 1924.0 post-romantic Italy
3 Shostakovich 1906 1975.0 modern Russia
4 Verdi 1813 1901.0 romantic Italy
5 Dvorak 1841 1904.0 romantic Czechia
6 Schumann 1810 1856.0 romantic Germany
7 Stravinsky 1882 1971.0 modern Russia
8 Sibelius 1865 1957.0 post-romantic Finland
9 Haydn 1732 1809.0 classic Austria
10 Mozart 1756 1791.0 classic Austria
11 Messiaen 1908 1992.0 modern France
12 Prokofiev 1891 1953.0 modern RUssia
13 Monteverdi 1567 1643.0 renaissance Italy
14 Haendel 1685 1759.0 baroque Germany
15 Brahms 1833 1897.0 romantic Germany
16 Purcell 1659 1695.0 baroque England
17 Charpentier 1643 1704.0 baroque France
18 Bruckner 1824 1896.0 post-romantic Austria
19 Berg 1885 1935.0 modern Austria
20 Couperin 1626 1661.0 baroque France
21 Rameau 1683 1764.0 baroque France
22 Berlioz 1803 1869.0 romantic France
23 Gounod 1818 1893.0 romantic France
24 Massenet 1842 1912.0 romantic France
25 Boulez 1925 2016.0 modern France
26 Palestrina 1525 1594.0 renaissance Italy
27 Gesualdo 1566 1613.0 renaissance Italy
28 Caldara 1670 1736.0 baroque Italy
29 Pergolesi 1710 1736.0 baroque Italy
30 Scarlatti 1685 1757.0 baroque Italy
31 Caccini 1587 1640.0 baroque Italy
32 Cimarosa 1749 1801.0 classic Italy
33 Donizetti 1797 1848.0 romantic Italy
34 Leoncavallo 1858 1919.0 romantic Italy
35 Bellini 1801 1835.0 romantic Italy
36 Dufay 1397 1474.0 renaissance Belgium
37 Lassus 1532 1594.0 renaissance Belgium
38 Borodin 1833 1887.0 romantic Russia
39 Mussorsgsky 1839 1881.0 romantic Russia
40 Soler 1754 1806.0 classic Spain
41 Albeniz 1860 1909.0 romantic Spain
42 Granados 1867 1916.0 romantic Spain
43 Dowland 1563 1626.0 renaissance England
44 Byrd 1540 1623.0 renaissance England
45 Walton 1902 1983.0 modern England
46 Adès 1971 NaN modern England
47 Bach 1685 1750.0 baroque Germany
48 Wagner 1813 1883.0 romantic Germany
49 Strauss 1864 1949.0 post-romantic Germany
50 Orff 1895 1982.0 modern Germany
51 Dusek 1731 1799.0 classic Czechia
52 Smetana 1824 1884.0 romantic Czechia
53 Janacek 1854 1928.0 modern Czechia
54 Copland 1900 1990.0 modern USA
55 Bernstein 1918 1990.0 modern USA
56 Glass 1937 NaN modern USA

Saving tables#

Most commonly, you don’t save many tables. You just keep a notebook or a script for processing and only keep the final output such as a statistics or a plot. However sometimes you want to keep intermediary steps. For that we can use the reverse of the read_XXX function, i.e the to_XXX function. Here again, you have plenty of options that you can explore. For example you can avoid saving the index and use ; as separator:

composers.to_csv('export_test.csv', index=False, sep=';')

Exercises#

  1. Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv

  2. Rename the bill_length_mm and bill_depth_mm colums into length and depth.

  3. Check that the names of the DataFrame have changed. If not, do you understand why?

  4. Add a column with name my_column and fill it with default value ‘test’.

  5. We have seen how to remove a given row. Can you find out how to remove the 10 first rows?

  6. What do you observe happened with the index?

  7. Import again the data but this time only the 10 first rows. Can you find an option of the read_csv function to do that?