7. Back to Pandas

7. Back to Pandas#

We have started this course by importing a DataFrame, briefly exploring it and discovering that Numpy was underlying these structures. Then we explored in more details those Numpy arrays. Now we come back to these DataFrames and will see that we can re-use many concepts from Numpy in Pandas.

import pandas as pd
import numpy as np

Creating Dataframes#

We have seen that we could extract Numpy arrays from a DataFrame, but we can also do the reverse: create a DataFrame from an array. While most of the time you will import data as DataFrames, sometimes it is useful to explicitly create an DataFrame.

From a Numpy array#

We know how to create a 2D Numpy array, for example:

my_array = np.random.randint(0, 100, ((3,5)))
my_array

array([[69,  0,  1, 91, 15],
       [68, 89, 69, 58, 50],
       [28, 94, 55, 40, 63]])

We can simply transform this array into a DataFrame using the pd.DataFrame function:

my_df = pd.DataFrame(my_array)
my_df

	0	1	2	3	4
0	69	0	1	91	15
1	68	89	69	58	50
2	28	94	55	40	63

We will see that many of the operations that we have used on Numpy arrays before also work for Pandas. As a start, you can for example get the shape of a DataFrame:

my_df.shape

(3, 5)

By default, Pandas uses a numerical index for each row and each column, but if we want we can override this by explicitly providing lists for the index and columns options.

my_df = pd.DataFrame(my_array, index=['row a','row b','row c'], columns=['col A', 'col B', 'col C','col D','col E'])
my_df

	col A	col B	col C	col D	col E
row a	69	0	1	91	15
row b	68	89	69	58	50
row c	28	94	55	40	63

From a dictionary#

There are multiple ways to create DataFrames. One of the most “natural” ones is to transform a dictionary of lists into an array. Here each element of the dictionary defines a column. For example:

dict_of_list = {
    'birth': [1860, 1770, 1858, 1906],
    'death':[1911, 1827, 1924, 1975], 
    'city':['Kaliste', 'Bonn', 'Lucques', 'Saint-Petersburg']
    }

pd.DataFrame(dict_of_list)

	birth	death	city
0	1860	1911	Kaliste
1	1770	1827	Bonn
2	1858	1924	Lucques
3	1906	1975	Saint-Petersburg

Index and column#

One of the main differences between Numpy arrays and Pandas DataFrames is that the elements in the table are not located by their “numerical position” but by their corresponding index and column. These two elements are in fact just list of values:

my_df.index

Index(['row a', 'row b', 'row c'], dtype='object')

my_df.columns

Index(['col A', 'col B', 'col C', 'col D', 'col E'], dtype='object')

Changing indices#

One can change these properties of the DataFrame directly by reassigning a new list. For example:

my_df.index = np.array(['x','y','z'])
my_df.columns = np.array(['new A','new B','new C','new D','new E'])
my_df

	new A	new B	new C	new D	new E
x	69	0	1	91	15
y	68	89	69	58	50
z	28	94	55	40	63

Alternatively, one can use dedicated functions that give us more flexibility. For example we can change just one column or index name by using a dictionary old -> new and the .rename() method. We also have to specify which axis we want to rename, either the 'columns' or the 'rows':

my_df = my_df.rename({'new A': 're-name A'},axis='columns')
my_df

	re-name A	new B	new C	new D	new E
x	69	0	1	91	15
y	68	89	69	58	50
z	28	94	55	40	63

my_df = my_df.rename({'x': 'new x'},axis='rows')
my_df

	re-name A	new B	new C	new D	new E
new x	69	0	1	91	15
y	68	89	69	58	50
z	28	94	55	40	63

Adding columns#

It is also very easy to add a new column to an existing DataFrame. For that you just assign a list or a single value to a not yet existing column:

my_df['new column'] = 1
my_df

	re-name A	new B	new C	new D	new E	new column
new x	69	0	1	91	15	1
y	68	89	69	58	50	1
z	28	94	55	40	63	1

We have seen before that DataFrame can contain different types of data, so we can also fill a column with text:

my_df['text column'] = ['a', 'b', 'c']
my_df

	re-name A	new B	new C	new D	new E	new column	text column
new x	69	0	1	91	15	1	a
y	68	89	69	58	50	1	b
z	28	94	55	40	63	1	c

Droping items#

Inversely we can remove columns by using the drop() function. Here also we have to say with axis if we want to affect 'rows' or 'columns':

my_df.drop(labels='new B', axis='columns')

	re-name A	new C	new D	new E	new column	text column
new x	69	1	91	15	1	a
y	68	69	58	50	1	b
z	28	55	40	63	1	c

my_df.drop(labels='y', axis='rows')

	re-name A	new B	new C	new D	new E	new column	text column
new x	69	0	1	91	15	1	a
z	28	94	55	40	63	1	c

Importing data#

Until now we have always used the reader_XXX functions with default options. However these functions have many options, in particular regarding the choice of indices and column names. For example, we can specify if we want to use one of the columns as index when importing data. Each data format will present its own challenges, and we present here some core concepts.

Here we are using the dataset that you can find here. It is a simple collection of Excel sheet with information regarding classical composers. These simple and rather short dataset will allow us to explore many of the problems that you might encounter with “real” datasets.

First we need a new importer function for Excel:

file_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'
composers = pd.read_excel(file_url)
composers

	composer	birth	death	city
0	Mahler	1860	1911	Kaliste
1	Beethoven	1770	1827	Bonn
2	Puccini	1858	1924	Lucques
3	Shostakovich	1906	1975	Saint-Petersburg

Setting indices#

Again, by default Pandas sets a numerical index. However here each composer name is unique, so we could use that as an index. We can specify this to our importer with the index_col option:

composers = pd.read_excel(file_url, index_col='composer')
composers

	birth	death	city
composer
Mahler	1860	1911	Kaliste
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg

Skipping rows#

Instead of dropping certains rows at the start or end of the table, we can also directly do that at import time by specifying the skipfooter and skiprows arguments. Let’s for example remove the first line:

composers = pd.read_excel(file_url, index_col='composer', skiprows=[0])
composers

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 composers = pd.read_excel(file_url, index_col='composer', skiprows=[0])
composers

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
   else:
       kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
if len(args) > num_allow_args:
   warnings.warn(
       msg.format(arguments=_format_argument_list(allow_args)),
       FutureWarning,
       stacklevel=find_stack_level(),
   )
--> 331 return func(*args, **kwargs)

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/excel/_base.py:490, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
   raise ValueError(
       "Engine should not be specified when passing "
       "an ExcelFile - ExcelFile already has the engine set"
   )
try:
--> 490     data = io.parse(
       sheet_name=sheet_name,
       header=header,
       names=names,
       index_col=index_col,
       usecols=usecols,
       squeeze=squeeze,
       dtype=dtype,
       converters=converters,
       true_values=true_values,
       false_values=false_values,
       skiprows=skiprows,
       nrows=nrows,
       na_values=na_values,
       keep_default_na=keep_default_na,
       na_filter=na_filter,
       verbose=verbose,
       parse_dates=parse_dates,
       date_parser=date_parser,
       thousands=thousands,
       decimal=decimal,
       comment=comment,
       skipfooter=skipfooter,
       convert_float=convert_float,
       mangle_dupe_cols=mangle_dupe_cols,
   )
finally:
   # make sure to close opened file handles
   if should_close:

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/excel/_base.py:1734, in ExcelFile.parse(self, sheet_name, header, names, index_col, usecols, squeeze, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
def parse(
   self,
   sheet_name: str | int | list[int] | list[str] | None = 0,
   (...)
   **kwds,
) -> DataFrame | dict[str, DataFrame] | dict[int, DataFrame]:
   """
   Parse specified sheet(s) into a DataFrame.

   (...)
       DataFrame from the passed in Excel file.
   """
-> 1734     return self._reader.parse(
       sheet_name=sheet_name,
       header=header,
       names=names,
       index_col=index_col,
       usecols=usecols,
       squeeze=squeeze,
       converters=converters,
       true_values=true_values,
       false_values=false_values,
       skiprows=skiprows,
       nrows=nrows,
       na_values=na_values,
       parse_dates=parse_dates,
       date_parser=date_parser,
       thousands=thousands,
       comment=comment,
       skipfooter=skipfooter,
       convert_float=convert_float,
       mangle_dupe_cols=mangle_dupe_cols,
       **kwds,
   )

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/excel/_base.py:891, in BaseExcelReader.parse(self, sheet_name, header, names, index_col, usecols, squeeze, dtype, true_values, false_values, skiprows, nrows, na_values, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
try:
   parser = TextParser(
       data,
       names=names,
   (...)
       **kwds,
   )
--> 891     output[asheetname] = parser.read(nrows=nrows)
   if not squeeze or isinstance(output[asheetname], DataFrame):
       if header_names:

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1778, in TextFileReader.read(self, nrows)
nrows = validate_integer("nrows", nrows)
try:
   # error: "ParserBase" has no attribute "read"
   (
       index,
       columns,
       col_dict,
-> 1778     ) = self._engine.read(  # type: ignore[attr-defined]
       nrows
   )
except Exception:
   self.close()

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py:288, in PythonParser.read(self, rows)
conv_data = self._convert_data(data)
columns, conv_data = self._do_date_conversions(columns, conv_data)
--> 288 index, result_columns = self._make_index(
   conv_data, alldata, columns, indexnamerow
)
return index, result_columns, conv_data

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py:379, in ParserBase._make_index(self, data, alldata, columns, indexnamerow)
   index = None
elif not self._has_complex_date_col:
--> 379     simple_index = self._get_simple_index(alldata, columns)
   index = self._agg_index(simple_index)
elif self._has_complex_date_col:

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py:411, in ParserBase._get_simple_index(self, data, columns)
index = []
for idx in self.index_col:
--> 411     i = ix(idx)
   to_remove.append(i)
   index.append(data[i])

File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py:406, in ParserBase._get_simple_index.<locals>.ix(col)
if not isinstance(col, str):
   return col
--> 406 raise ValueError(f"Index {col} invalid")

ValueError: Index composer invalid

We ran into trouble, because we skipped the first row which in reality is just the row with the column names! To skip the actual first row we need:

composers = pd.read_excel(file_url, index_col='composer', skiprows=[1])
composers

	birth	death	city
composer
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg

We can also specify a list of columns that we actually want to use with the usecols option. For example we could skip the city:

composers = pd.read_excel(file_url, index_col='composer', usecols=['composer','birth', 'death'])
composers

	birth	death
composer
Mahler	1860	1911
Beethoven	1770	1827
Puccini	1858	1924
Shostakovich	1906	1975

Format-specific options#

Finally, for each importer, we have specific options only available in that format. For example with the Excel importer we can specify sheets to import. By default it imports the first one, but we can also indicate the position of the sheet or its name:

composers = pd.read_excel(file_url, sheet_name=4)
composers

	composer	birth	death	period	country
0	Mahler	1860	1911.0	post-romantic	Austria
1	Beethoven	1770	1827.0	romantic	Germany
2	Puccini	1858	1924.0	post-romantic	Italy
3	Shostakovich	1906	1975.0	modern	Russia
4	Verdi	1813	1901.0	romantic	Italy
5	Dvorak	1841	1904.0	romantic	Czechia
6	Schumann	1810	1856.0	romantic	Germany
7	Stravinsky	1882	1971.0	modern	Russia
8	Sibelius	1865	1957.0	post-romantic	Finland
9	Haydn	1732	1809.0	classic	Austria
10	Mozart	1756	1791.0	classic	Austria
11	Messiaen	1908	1992.0	modern	France
12	Prokofiev	1891	1953.0	modern	RUssia
13	Monteverdi	1567	1643.0	renaissance	Italy
14	Haendel	1685	1759.0	baroque	Germany
15	Brahms	1833	1897.0	romantic	Germany
16	Purcell	1659	1695.0	baroque	England
17	Charpentier	1643	1704.0	baroque	France
18	Bruckner	1824	1896.0	post-romantic	Austria
19	Berg	1885	1935.0	modern	Austria
20	Couperin	1626	1661.0	baroque	France
21	Rameau	1683	1764.0	baroque	France
22	Berlioz	1803	1869.0	romantic	France
23	Gounod	1818	1893.0	romantic	France
24	Massenet	1842	1912.0	romantic	France
25	Boulez	1925	2016.0	modern	France
26	Palestrina	1525	1594.0	renaissance	Italy
27	Gesualdo	1566	1613.0	renaissance	Italy
28	Caldara	1670	1736.0	baroque	Italy
29	Pergolesi	1710	1736.0	baroque	Italy
30	Scarlatti	1685	1757.0	baroque	Italy
31	Caccini	1587	1640.0	baroque	Italy
32	Cimarosa	1749	1801.0	classic	Italy
33	Donizetti	1797	1848.0	romantic	Italy
34	Leoncavallo	1858	1919.0	romantic	Italy
35	Bellini	1801	1835.0	romantic	Italy
36	Dufay	1397	1474.0	renaissance	Belgium
37	Lassus	1532	1594.0	renaissance	Belgium
38	Borodin	1833	1887.0	romantic	Russia
39	Mussorsgsky	1839	1881.0	romantic	Russia
40	Soler	1754	1806.0	classic	Spain
41	Albeniz	1860	1909.0	romantic	Spain
42	Granados	1867	1916.0	romantic	Spain
43	Dowland	1563	1626.0	renaissance	England
44	Byrd	1540	1623.0	renaissance	England
45	Walton	1902	1983.0	modern	England
46	Adès	1971	NaN	modern	England
47	Bach	1685	1750.0	baroque	Germany
48	Wagner	1813	1883.0	romantic	Germany
49	Strauss	1864	1949.0	post-romantic	Germany
50	Orff	1895	1982.0	modern	Germany
51	Dusek	1731	1799.0	classic	Czechia
52	Smetana	1824	1884.0	romantic	Czechia
53	Janacek	1854	1928.0	modern	Czechia
54	Copland	1900	1990.0	modern	USA
55	Bernstein	1918	1990.0	modern	USA
56	Glass	1937	NaN	modern	USA

Saving tables#

Most commonly, you don’t save many tables. You just keep a notebook or a script for processing and only keep the final output such as a statistics or a plot. However sometimes you want to keep intermediary steps. For that we can use the reverse of the read_XXX function, i.e the to_XXX function. Here again, you have plenty of options that you can explore. For example you can avoid saving the index and use ; as separator:

composers.to_csv('export_test.csv', index=False, sep=';')

Exercises#

Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
Rename the bill_length_mm and bill_depth_mm colums into length and depth.
Check that the names of the DataFrame have changed. If not, do you understand why?
Add a column with name my_column and fill it with default value ‘test’.
We have seen how to remove a given row. Can you find out how to remove the 10 first rows?
What do you observe happened with the index?
Import again the data but this time only the 10 first rows. Can you find an option of the read_csv function to do that?