2. Numpy arrays#

We have seen in the last notebook that the objects underlying the complex DataFrames are Numpy arrays. Why do we need this additional container and why can’t we just use Python lists ?

Let’s imagine we have a list containing weights in gramms:

gramms = [5400, 3491, 2591, 14100]

Now we want to transform this list into kilogramms. We don’t have any other choice than using a for loop (or a comprehension list) to divide each element by 1000:

kilogramms = []
for i in range(len(gramms)):
    new_value = gramms[i]/1000
    kilogramms.append(new_value)
kilogramms
[5.4, 3.491, 2.591, 14.1]

You can imagine much more complex cases, e.g. where we mix multiple lists, that makes this writing cumbersome and slow. What arrays provide us is vectorized computations.

Creating an array#

To see how this works , let’s create a Numpy array (without extracting it from a DataFrame). First of all, let’s import Numpy.

import numpy as np

We can easily turn our previous list into an array using the np.array function:

gramms_array = np.array(gramms)
gramms_array
array([ 5400,  3491,  2591, 14100])

Vectorization means now that we can operate on the list as one object, i.e. we can do mathematics with it as with a single number. In our example:

kilogramms_array = gramms_array / 1000
kilogramms_array
array([ 5.4  ,  3.491,  2.591, 14.1  ])

As mentioned above, this also works if we need to performe a computation which uses multiple arrays. Let’s imagine we have a list of price/\(m^2\) and surface for a series of appartments:

price_per_m2 = [6, 10.3, 12.4, 10.6, 5.7, 4.3, 14, 0.5, 0.5, 17.8, 12.7, 16, 2.7, 17.5, 5.2, 7.1, 1.2, 7.2, 14.5, 11.9]
surface = [238, 239, 265, 212, 143, 132, 142, 133, 109, 291, 225, 165, 141, 197, 298, 289, 123,  90, 132, 203]

Now if we want to calculate the price of the apartment, we can just multiply each price/\(m^2\) by the surface. We can do that by creating a for loop and filling a new list with the values:

price = []
for i in range(len(price_per_m2)):
    current_price = price_per_m2[i] * surface[i]
    price.append(current_price)
price
[1428,
 2461.7000000000003,
 3286.0,
 2247.2,
 815.1,
 567.6,
 1988,
 66.5,
 54.5,
 5179.8,
 2857.5,
 2640,
 380.70000000000005,
 3447.5,
 1549.6000000000001,
 2051.9,
 147.6,
 648.0,
 1914.0,
 2415.7000000000003]

Again we transform the two lists into arrays:

price_per_m2_array = np.array(price_per_m2)
surface_array = np.array(surface)

Instead of having to write a foor loop, Numpy allows us now to just use a standard mathemetical operation where we multiply the two arrays:

price_array = price_per_m2_array * surface_array
price_array
array([1428. , 2461.7, 3286. , 2247.2,  815.1,  567.6, 1988. ,   66.5,
         54.5, 5179.8, 2857.5, 2640. ,  380.7, 3447.5, 1549.6, 2051.9,
        147.6,  648. , 1914. , 2415.7])

You see that when multiplying two arrays, Numpy simply multiplies each element of one array by the equivalent element of the other array.

Advantages of vectorization#

There are two main advantages to this approach. First it makes the code much simpler: we achieved in a single line, what took an entire for loop with simples lists (there are slightly more efficient ways to do that even in plain Python via comprehension lists).

Second, it makes our code run much faster. When we do a for loop, each operation is done separately, and since Python is dynamically typed (you don’t have to say whether a variable is text or numbers) it has to repeatedly carry out verifications. In the Numpy vectorized version, all multiplications can be done in parallel because: 1) the array contains only one type of variables so that no controls have to be done and 2) arrays are efficiently stored as blocks in memory so that individual values don’t have to be “searched” for.

With this very simple example, we can compare the execution time using the magic command %%timeit:

%%timeit -n 10000 -r 5 
price = []
for i in range(len(price_per_m2)):
    current_price = price_per_m2[i] * surface[i]
    price.append(current_price)
2.14 µs ± 522 ns per loop (mean ± std. dev. of 5 runs, 10,000 loops each)
%%timeit -n 10000 -r 5
price_array = price_per_m2_array * surface_array
956 ns ± 522 ns per loop (mean ± std. dev. of 5 runs, 10,000 loops each)

Array type#

We have mentioned above that computation is fast because the type of the arrays is known. This means that all the elements of an array must have the same type. Numpy implements its own types called dtype. We can access to the type of an array using this:

price_per_m2_array
array([ 6. , 10.3, 12.4, 10.6,  5.7,  4.3, 14. ,  0.5,  0.5, 17.8, 12.7,
       16. ,  2.7, 17.5,  5.2,  7.1,  1.2,  7.2, 14.5, 11.9])
price_per_m2_array.dtype
dtype('float64')

We see that by default Numpy decided that the price had float64 dtype because the numbers we used had a comma. Notice it also turned the numbers that didn’t have a comma into floats (like the first element 6). Since all elements of an array need to have the same type, Numpy just selects the most complex one for the entire array.

Let’s see what dtype the surface array has:

surface_array.dtype
dtype('int64')

We only used integerer numbers in that list, and therefore Numpy can use a “simpler” dtype for that array.

Finally let’s see the result of our multiplication:

price_array.dtype
dtype('float64')

When combining multiple arrays, Numpy always selects the most complex dtype for the output.

If needed we can also change the dtype of an array explicitly using the as_type method. For example if we want our surface_array to be a float instead of an integer we can write:

surface_array_float = surface_array.astype(np.float64)
surface_array_float.dtype
dtype('float64')

Notice how we had to create a new array: most operations on Numpy arrays are not done in place i.e. the array itsels is not changed.

Back to Pandas#

Before we explore a bit further Numpy arrays and the operations we can apply on them, let’s briefly come back to our Pandas DataFrame. We will use a simpler table that is also available online here. Notice that this is this time really an Excel sheet, so we use the read_excel function:

import pandas as pd

composers = pd.read_excel('https://github.com/guiwitz/NumpyPandas_course/blob/master/Data/composers.xlsx?raw=true')
composers
composer birth death city
0 Mahler 1860 1911 Kaliste
1 Beethoven 1770 1827 Bonn
2 Puccini 1858 1924 Lucques
3 Shostakovich 1906 1975 Saint-Petersburg

Let’s look at the birth column:

composers['birth']
0    1860
1    1770
2    1858
3    1906
Name: birth, dtype: int64

We see that we here also get a dtype information, in this case int64, since the underlying data of the DataFrame are Numpy arrays.

Just like with Numpy arrays, we can explicitely ask for the dtype:

composers['birth'].dtype
dtype('int64')

And we can also change the dtype using astype(). Here again, we need to asign the resulto the change to a new series or directly to the original DataFrame:

composers['birth'] = composers['birth'].astype(np.float64)
composers
composer birth death city
0 Mahler 1860.0 1911 Kaliste
1 Beethoven 1770.0 1827 Bonn
2 Puccini 1858.0 1924 Lucques
3 Shostakovich 1906.0 1975 Saint-Petersburg
composers['birth'].dtype
dtype('float64')

We immediately see that the numbers in the birth column have now a comma, and if we ask for the column type, we indeed get now a float.

Exercise#

  1. Create an array with 3 elements and one with 5 elements containing integers

  2. Try to multiply the two arrays.

  3. You should get an error message. Do you understand the problem ? How can you fix it?

  4. Change the dtype of the output to float32.

  5. Import the file that you cand find here: https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv

  6. Use head() to visualize a few lines

  7. What’s the type of the body_mass_g and year columns ?

  8. Transform the type of the year column into a float