7. Back to Pandas#
We have started this course by importing a DataFrame, briefly exploring it and discovering that Numpy was underlying these structures. Then we explored in more details those Numpy arrays. Now we come back to these DataFrames and will see that we can re-use many concepts from Numpy in Pandas.
import pandas as pd
import numpy as np
Creating Dataframes#
We have seen that we could extract Numpy arrays from a DataFrame, but we can also do the reverse: create a DataFrame from an array. While most of the time you will import data as DataFrames, sometimes it is useful to explicitly create an DataFrame.
From a Numpy array#
We know how to create a 2D Numpy array, for example:
my_array = np.random.randint(0, 100, ((3,5)))
my_array
array([[69, 0, 1, 91, 15],
[68, 89, 69, 58, 50],
[28, 94, 55, 40, 63]])
We can simply transform this array into a DataFrame using the pd.DataFrame
function:
my_df = pd.DataFrame(my_array)
my_df
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 69 | 0 | 1 | 91 | 15 |
1 | 68 | 89 | 69 | 58 | 50 |
2 | 28 | 94 | 55 | 40 | 63 |
We will see that many of the operations that we have used on Numpy arrays before also work for Pandas. As a start, you can for example get the shape of a DataFrame:
my_df.shape
(3, 5)
By default, Pandas uses a numerical index for each row and each column, but if we want we can override this by explicitly providing lists for the index
and columns
options.
my_df = pd.DataFrame(my_array, index=['row a','row b','row c'], columns=['col A', 'col B', 'col C','col D','col E'])
my_df
col A | col B | col C | col D | col E | |
---|---|---|---|---|---|
row a | 69 | 0 | 1 | 91 | 15 |
row b | 68 | 89 | 69 | 58 | 50 |
row c | 28 | 94 | 55 | 40 | 63 |
From a dictionary#
There are multiple ways to create DataFrames. One of the most “natural” ones is to transform a dictionary of lists into an array. Here each element of the dictionary defines a column. For example:
dict_of_list = {
'birth': [1860, 1770, 1858, 1906],
'death':[1911, 1827, 1924, 1975],
'city':['Kaliste', 'Bonn', 'Lucques', 'Saint-Petersburg']
}
pd.DataFrame(dict_of_list)
birth | death | city | |
---|---|---|---|
0 | 1860 | 1911 | Kaliste |
1 | 1770 | 1827 | Bonn |
2 | 1858 | 1924 | Lucques |
3 | 1906 | 1975 | Saint-Petersburg |
Index and column#
One of the main differences between Numpy arrays and Pandas DataFrames is that the elements in the table are not located by their “numerical position” but by their corresponding index and column. These two elements are in fact just list of values:
my_df.index
Index(['row a', 'row b', 'row c'], dtype='object')
my_df.columns
Index(['col A', 'col B', 'col C', 'col D', 'col E'], dtype='object')
Changing indices#
One can change these properties of the DataFrame directly by reassigning a new list. For example:
my_df.index = np.array(['x','y','z'])
my_df.columns = np.array(['new A','new B','new C','new D','new E'])
my_df
new A | new B | new C | new D | new E | |
---|---|---|---|---|---|
x | 69 | 0 | 1 | 91 | 15 |
y | 68 | 89 | 69 | 58 | 50 |
z | 28 | 94 | 55 | 40 | 63 |
Alternatively, one can use dedicated functions that give us more flexibility. For example we can change just one column or index name by using a dictionary old -> new and the .rename()
method. We also have to specify which axis
we want to rename, either the 'columns'
or the 'rows'
:
my_df = my_df.rename({'new A': 're-name A'},axis='columns')
my_df
re-name A | new B | new C | new D | new E | |
---|---|---|---|---|---|
x | 69 | 0 | 1 | 91 | 15 |
y | 68 | 89 | 69 | 58 | 50 |
z | 28 | 94 | 55 | 40 | 63 |
my_df = my_df.rename({'x': 'new x'},axis='rows')
my_df
re-name A | new B | new C | new D | new E | |
---|---|---|---|---|---|
new x | 69 | 0 | 1 | 91 | 15 |
y | 68 | 89 | 69 | 58 | 50 |
z | 28 | 94 | 55 | 40 | 63 |
Adding columns#
It is also very easy to add a new column to an existing DataFrame. For that you just assign a list or a single value to a not yet existing column:
my_df['new column'] = 1
my_df
re-name A | new B | new C | new D | new E | new column | |
---|---|---|---|---|---|---|
new x | 69 | 0 | 1 | 91 | 15 | 1 |
y | 68 | 89 | 69 | 58 | 50 | 1 |
z | 28 | 94 | 55 | 40 | 63 | 1 |
We have seen before that DataFrame can contain different types of data, so we can also fill a column with text:
my_df['text column'] = ['a', 'b', 'c']
my_df
re-name A | new B | new C | new D | new E | new column | text column | |
---|---|---|---|---|---|---|---|
new x | 69 | 0 | 1 | 91 | 15 | 1 | a |
y | 68 | 89 | 69 | 58 | 50 | 1 | b |
z | 28 | 94 | 55 | 40 | 63 | 1 | c |
Droping items#
Inversely we can remove columns by using the drop()
function. Here also we have to say with axis
if we want to affect 'rows'
or 'columns'
:
my_df.drop(labels='new B', axis='columns')
re-name A | new C | new D | new E | new column | text column | |
---|---|---|---|---|---|---|
new x | 69 | 1 | 91 | 15 | 1 | a |
y | 68 | 69 | 58 | 50 | 1 | b |
z | 28 | 55 | 40 | 63 | 1 | c |
my_df.drop(labels='y', axis='rows')
re-name A | new B | new C | new D | new E | new column | text column | |
---|---|---|---|---|---|---|---|
new x | 69 | 0 | 1 | 91 | 15 | 1 | a |
z | 28 | 94 | 55 | 40 | 63 | 1 | c |
Importing data#
Until now we have always used the reader_XXX
functions with default options. However these functions have many options, in particular regarding the choice of indices and column names. For example, we can specify if we want to use one of the columns as index when importing data. Each data format will present its own challenges, and we present here some core concepts.
Here we are using the dataset that you can find here. It is a simple collection of Excel sheet with information regarding classical composers. These simple and rather short dataset will allow us to explore many of the problems that you might encounter with “real” datasets.
First we need a new importer function for Excel:
file_url = 'https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true'
composers = pd.read_excel(file_url)
composers
composer | birth | death | city | |
---|---|---|---|---|
0 | Mahler | 1860 | 1911 | Kaliste |
1 | Beethoven | 1770 | 1827 | Bonn |
2 | Puccini | 1858 | 1924 | Lucques |
3 | Shostakovich | 1906 | 1975 | Saint-Petersburg |
Setting indices#
Again, by default Pandas sets a numerical index. However here each composer name is unique, so we could use that as an index. We can specify this to our importer with the index_col
option:
composers = pd.read_excel(file_url, index_col='composer')
composers
birth | death | city | |
---|---|---|---|
composer | |||
Mahler | 1860 | 1911 | Kaliste |
Beethoven | 1770 | 1827 | Bonn |
Puccini | 1858 | 1924 | Lucques |
Shostakovich | 1906 | 1975 | Saint-Petersburg |
Skipping rows#
Instead of dropping certains rows at the start or end of the table, we can also directly do that at import time by specifying the skipfooter
and skiprows
arguments. Let’s for example remove the first line:
composers = pd.read_excel(file_url, index_col='composer', skiprows=[0])
composers
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[19], line 1
----> 1 composers = pd.read_excel(file_url, index_col='composer', skiprows=[0])
2 composers
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/excel/_base.py:490, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
484 raise ValueError(
485 "Engine should not be specified when passing "
486 "an ExcelFile - ExcelFile already has the engine set"
487 )
489 try:
--> 490 data = io.parse(
491 sheet_name=sheet_name,
492 header=header,
493 names=names,
494 index_col=index_col,
495 usecols=usecols,
496 squeeze=squeeze,
497 dtype=dtype,
498 converters=converters,
499 true_values=true_values,
500 false_values=false_values,
501 skiprows=skiprows,
502 nrows=nrows,
503 na_values=na_values,
504 keep_default_na=keep_default_na,
505 na_filter=na_filter,
506 verbose=verbose,
507 parse_dates=parse_dates,
508 date_parser=date_parser,
509 thousands=thousands,
510 decimal=decimal,
511 comment=comment,
512 skipfooter=skipfooter,
513 convert_float=convert_float,
514 mangle_dupe_cols=mangle_dupe_cols,
515 )
516 finally:
517 # make sure to close opened file handles
518 if should_close:
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/excel/_base.py:1734, in ExcelFile.parse(self, sheet_name, header, names, index_col, usecols, squeeze, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
1700 def parse(
1701 self,
1702 sheet_name: str | int | list[int] | list[str] | None = 0,
(...)
1721 **kwds,
1722 ) -> DataFrame | dict[str, DataFrame] | dict[int, DataFrame]:
1723 """
1724 Parse specified sheet(s) into a DataFrame.
1725
(...)
1732 DataFrame from the passed in Excel file.
1733 """
-> 1734 return self._reader.parse(
1735 sheet_name=sheet_name,
1736 header=header,
1737 names=names,
1738 index_col=index_col,
1739 usecols=usecols,
1740 squeeze=squeeze,
1741 converters=converters,
1742 true_values=true_values,
1743 false_values=false_values,
1744 skiprows=skiprows,
1745 nrows=nrows,
1746 na_values=na_values,
1747 parse_dates=parse_dates,
1748 date_parser=date_parser,
1749 thousands=thousands,
1750 comment=comment,
1751 skipfooter=skipfooter,
1752 convert_float=convert_float,
1753 mangle_dupe_cols=mangle_dupe_cols,
1754 **kwds,
1755 )
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/excel/_base.py:891, in BaseExcelReader.parse(self, sheet_name, header, names, index_col, usecols, squeeze, dtype, true_values, false_values, skiprows, nrows, na_values, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
865 try:
866 parser = TextParser(
867 data,
868 names=names,
(...)
888 **kwds,
889 )
--> 891 output[asheetname] = parser.read(nrows=nrows)
893 if not squeeze or isinstance(output[asheetname], DataFrame):
894 if header_names:
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1778, in TextFileReader.read(self, nrows)
1771 nrows = validate_integer("nrows", nrows)
1772 try:
1773 # error: "ParserBase" has no attribute "read"
1774 (
1775 index,
1776 columns,
1777 col_dict,
-> 1778 ) = self._engine.read( # type: ignore[attr-defined]
1779 nrows
1780 )
1781 except Exception:
1782 self.close()
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py:288, in PythonParser.read(self, rows)
285 conv_data = self._convert_data(data)
286 columns, conv_data = self._do_date_conversions(columns, conv_data)
--> 288 index, result_columns = self._make_index(
289 conv_data, alldata, columns, indexnamerow
290 )
292 return index, result_columns, conv_data
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py:379, in ParserBase._make_index(self, data, alldata, columns, indexnamerow)
376 index = None
378 elif not self._has_complex_date_col:
--> 379 simple_index = self._get_simple_index(alldata, columns)
380 index = self._agg_index(simple_index)
381 elif self._has_complex_date_col:
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py:411, in ParserBase._get_simple_index(self, data, columns)
409 index = []
410 for idx in self.index_col:
--> 411 i = ix(idx)
412 to_remove.append(i)
413 index.append(data[i])
File ~/mambaforge/envs/DAVPy2023/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py:406, in ParserBase._get_simple_index.<locals>.ix(col)
404 if not isinstance(col, str):
405 return col
--> 406 raise ValueError(f"Index {col} invalid")
ValueError: Index composer invalid
We ran into trouble, because we skipped the first row which in reality is just the row with the column names! To skip the actual first row we need:
composers = pd.read_excel(file_url, index_col='composer', skiprows=[1])
composers
birth | death | city | |
---|---|---|---|
composer | |||
Beethoven | 1770 | 1827 | Bonn |
Puccini | 1858 | 1924 | Lucques |
Shostakovich | 1906 | 1975 | Saint-Petersburg |
We can also specify a list of columns that we actually want to use with the usecols
option. For example we could skip the city
:
composers = pd.read_excel(file_url, index_col='composer', usecols=['composer','birth', 'death'])
composers
birth | death | |
---|---|---|
composer | ||
Mahler | 1860 | 1911 |
Beethoven | 1770 | 1827 |
Puccini | 1858 | 1924 |
Shostakovich | 1906 | 1975 |
Format-specific options#
Finally, for each importer, we have specific options only available in that format. For example with the Excel importer we can specify sheets to import. By default it imports the first one, but we can also indicate the position of the sheet or its name:
composers = pd.read_excel(file_url, sheet_name=4)
composers
composer | birth | death | period | country | |
---|---|---|---|---|---|
0 | Mahler | 1860 | 1911.0 | post-romantic | Austria |
1 | Beethoven | 1770 | 1827.0 | romantic | Germany |
2 | Puccini | 1858 | 1924.0 | post-romantic | Italy |
3 | Shostakovich | 1906 | 1975.0 | modern | Russia |
4 | Verdi | 1813 | 1901.0 | romantic | Italy |
5 | Dvorak | 1841 | 1904.0 | romantic | Czechia |
6 | Schumann | 1810 | 1856.0 | romantic | Germany |
7 | Stravinsky | 1882 | 1971.0 | modern | Russia |
8 | Sibelius | 1865 | 1957.0 | post-romantic | Finland |
9 | Haydn | 1732 | 1809.0 | classic | Austria |
10 | Mozart | 1756 | 1791.0 | classic | Austria |
11 | Messiaen | 1908 | 1992.0 | modern | France |
12 | Prokofiev | 1891 | 1953.0 | modern | RUssia |
13 | Monteverdi | 1567 | 1643.0 | renaissance | Italy |
14 | Haendel | 1685 | 1759.0 | baroque | Germany |
15 | Brahms | 1833 | 1897.0 | romantic | Germany |
16 | Purcell | 1659 | 1695.0 | baroque | England |
17 | Charpentier | 1643 | 1704.0 | baroque | France |
18 | Bruckner | 1824 | 1896.0 | post-romantic | Austria |
19 | Berg | 1885 | 1935.0 | modern | Austria |
20 | Couperin | 1626 | 1661.0 | baroque | France |
21 | Rameau | 1683 | 1764.0 | baroque | France |
22 | Berlioz | 1803 | 1869.0 | romantic | France |
23 | Gounod | 1818 | 1893.0 | romantic | France |
24 | Massenet | 1842 | 1912.0 | romantic | France |
25 | Boulez | 1925 | 2016.0 | modern | France |
26 | Palestrina | 1525 | 1594.0 | renaissance | Italy |
27 | Gesualdo | 1566 | 1613.0 | renaissance | Italy |
28 | Caldara | 1670 | 1736.0 | baroque | Italy |
29 | Pergolesi | 1710 | 1736.0 | baroque | Italy |
30 | Scarlatti | 1685 | 1757.0 | baroque | Italy |
31 | Caccini | 1587 | 1640.0 | baroque | Italy |
32 | Cimarosa | 1749 | 1801.0 | classic | Italy |
33 | Donizetti | 1797 | 1848.0 | romantic | Italy |
34 | Leoncavallo | 1858 | 1919.0 | romantic | Italy |
35 | Bellini | 1801 | 1835.0 | romantic | Italy |
36 | Dufay | 1397 | 1474.0 | renaissance | Belgium |
37 | Lassus | 1532 | 1594.0 | renaissance | Belgium |
38 | Borodin | 1833 | 1887.0 | romantic | Russia |
39 | Mussorsgsky | 1839 | 1881.0 | romantic | Russia |
40 | Soler | 1754 | 1806.0 | classic | Spain |
41 | Albeniz | 1860 | 1909.0 | romantic | Spain |
42 | Granados | 1867 | 1916.0 | romantic | Spain |
43 | Dowland | 1563 | 1626.0 | renaissance | England |
44 | Byrd | 1540 | 1623.0 | renaissance | England |
45 | Walton | 1902 | 1983.0 | modern | England |
46 | Adès | 1971 | NaN | modern | England |
47 | Bach | 1685 | 1750.0 | baroque | Germany |
48 | Wagner | 1813 | 1883.0 | romantic | Germany |
49 | Strauss | 1864 | 1949.0 | post-romantic | Germany |
50 | Orff | 1895 | 1982.0 | modern | Germany |
51 | Dusek | 1731 | 1799.0 | classic | Czechia |
52 | Smetana | 1824 | 1884.0 | romantic | Czechia |
53 | Janacek | 1854 | 1928.0 | modern | Czechia |
54 | Copland | 1900 | 1990.0 | modern | USA |
55 | Bernstein | 1918 | 1990.0 | modern | USA |
56 | Glass | 1937 | NaN | modern | USA |
Saving tables#
Most commonly, you don’t save many tables. You just keep a notebook or a script for processing and only keep the final output such as a statistics or a plot. However sometimes you want to keep intermediary steps. For that we can use the reverse of the read_XXX
function, i.e the to_XXX
function. Here again, you have plenty of options that you can explore. For example you can avoid saving the index and use ;
as separator:
composers.to_csv('export_test.csv', index=False, sep=';')
Exercises#
Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
Rename the
bill_length_mm
andbill_depth_mm
colums intolength
anddepth
.Check that the names of the DataFrame have changed. If not, do you understand why?
Add a column with name
my_column
and fill it with default value ‘test’.We have seen how to remove a given row. Can you find out how to remove the 10 first rows?
What do you observe happened with the index?
Import again the data but this time only the 10 first rows. Can you find an option of the
read_csv
function to do that?