4. Numpy indexing#

Often we need to extract some information from a Pandas DataFrame. Here also, Pandas inherits many of the approaches used in Numpy. Therefore we start here by very briefly showing how to proceed with plain arrays before looking at the more complex DataFrames.

Note that Numpy indexing is very powerful and that we cover here only a tiny fraction of this topic. To learn more you can for example the Numpy reference.

import numpy as np

We first create an array:

my_array = np.random.normal(size=10)
my_array
array([-0.56717818, -1.22468922,  1.22088759,  0.91442352,  0.45195361,
        0.58047433, -1.28138648, -0.64241218,  0.60848236, -0.05006814])

Extracting and setting elements#

The standard way to extract information from an array is to used the square parenthesis (bracket) notation. If we want for example to extract the second element of the array we write:

my_array[1]
-1.2246892151602833

Remember that we start counting from 0 in Python, which is why the second element has index 1.

We can extend the notation and extract a range of elements by using the from_index:to_index (excluded) notation. Here excluded means that the last index specified is not included. For example if we want to recover elements with indices from 1 to 3 we write:

my_array[1:4]
array([-1.22468922,  1.22088759,  0.91442352])

We can also set values in the array in the same maner. For example let’s set the above elements to 10:

my_array[1:4] = 10
my_array
array([-0.56717818, 10.        , 10.        , 10.        ,  0.45195361,
        0.58047433, -1.28138648, -0.64241218,  0.60848236, -0.05006814])

Note that you can sometimes simplify the notation. For example if you want to extract all elements from the 4th one to the last one, you don’t have to specify the last index, you can simply replace it by ::

my_array[4::]
array([ 0.45195361,  0.58047433, -1.28138648, -0.64241218,  0.60848236,
       -0.05006814])

Higher dimensions#

We have seen before that we can create arrays with more than one dimension (think e.g. of the pixels of an image). For example:

array2D = np.random.normal(size=(3,5))
array2D
array([[ 2.40100503,  0.13748981, -0.62384989, -1.16260668,  0.45016854],
       [ 0.85358083,  0.63036613,  1.22776553, -0.17238484, -1.15729928],
       [ 0.32684466,  0.24254852,  1.02185303, -0.11438528, -1.42805106]])

The indexing system works in the same way here. We just have to specify now for each dimension which rows/columns we want to extract with my_array[start_row:end_row, start_column:end_column]:

array2D[1:3, 0:2]
array([[0.85358083, 0.63036613],
       [0.32684466, 0.24254852]])

Here again, we can simplify the notation. If we want to select a few rows but want to keep all columns, we can again use the : notation like this:

array2D[1:3, :]
array([[ 0.85358083,  0.63036613,  1.22776553, -0.17238484, -1.15729928],
       [ 0.32684466,  0.24254852,  1.02185303, -0.11438528, -1.42805106]])

Working with sub-parts#

Using indexing, we can also create a smaller array that we want to work on specifically. For example let’s say we are only interested in the 6th to 8th element. We can extract it and asign it to a new array:

sub_array = my_array[7:10]
my_array
array([-0.56717818, 10.        , 10.        , 10.        ,  0.45195361,
        0.58047433, -1.28138648, -0.64241218,  0.60848236, -0.05006814])
sub_array
array([-0.64241218,  0.60848236, -0.05006814])

Let’s now modify an element of this subarray:

sub_array[0] = 100

Let’s check that sub_array has indeed changed:

sub_array
array([ 1.00000000e+02,  6.08482363e-01, -5.00681387e-02])

Let’s now also have a look at the original array:

my_array
array([-5.67178177e-01,  1.00000000e+01,  1.00000000e+01,  1.00000000e+01,
        4.51953606e-01,  5.80474333e-01, -1.28138648e+00,  1.00000000e+02,
        6.08482363e-01, -5.00681387e-02])

The value in the original array has changed too!. The reason is that the slicing of the array does not create an independent sub-array. It is still linked to the original one. Depending on the types of modification, you might or might not encounter this problem. To be on the safe side, explicitely create a copy when creating a sub-array. Like that it will be independent from the original one:

sub_array = my_array[7:10].copy()
sub_array[0] = 200
sub_array
array([ 2.00000000e+02,  6.08482363e-01, -5.00681387e-02])
my_array
array([-5.67178177e-01,  1.00000000e+01,  1.00000000e+01,  1.00000000e+01,
        4.51953606e-01,  5.80474333e-01, -1.28138648e+00,  1.00000000e+02,
        6.08482363e-01, -5.00681387e-02])

Boolean indexing#

Instead of using numerical indices to extract values from the array, we can also select them by some criteria. Let’s create a new random array:

my_array2 = np.random.normal(size=10)
my_array2
array([ 1.70429449,  0.81439495,  1.1642333 ,  0.34425266,  0.21265945,
       -1.23879187, -0.80424111,  0.10781845, -0.74197375,  0.11844308])

How to proceed now if we for example only want to recover the elements that are larger than 0 ?

Let’s try to see what happens when we just write it down as we would in regular mathemetics:

my_array2 > 0
array([ True,  True,  True,  True,  True, False, False,  True, False,
        True])

We see that the output is again an array, but instead of being filled with numbers, it contains only False and True. Those values also exist in plain Python and are called booleans. For example:

a = 3
a > 10
False

We can now create an actual boolean array:

bool_array = my_array2 > 0
bool_array
array([ True,  True,  True,  True,  True, False, False,  True, False,
        True])

We can now use this boolean array bool_array to extract values from any array of the same size. Imagine that you superpose bool_array to another array value_array and only select those values in value_array which are True in bool_array. Naturally we can do this with the original array itself. Instead of passing and index my_array[i] we pass the entire bool_array:

from IPython.display import Image
Image(url='https://github.com/guiwitz/ISDAwPython_day2/raw/master/images/logical_indexing.jpeg',width=700)
my_array2[bool_array] 
array([1.70429449, 0.81439495, 1.1642333 , 0.34425266, 0.21265945,
       0.10781845, 0.11844308])

Naturally this output array is much smaller than the original one as it only contains the values larger than 0.

Exercise#

  1. Create a numpy array with values from 0 to 10 in steps of 0.5

  2. Extract the the last three elements of the array using slicing.

  3. Apply a cosine function to the full array created in (1.) and store the output in a new array.

  4. Create a boolean array telling which values in the array from (3.) are smaller than 0.

  5. Recover only those values in a new array via indexing.