Data structures#

In an analysis pipeline, having variables as single numbers or text is obviously not enough. We need “containers” that can contain more complex data. Think for example of n-dimensional matrices to contain image, tables to contain analysis outputs (e.g. size and intensity of detected objects).

Some of the structures we will use come directly with Python (e.g. lists) while others are implemented by external packages, like Numpy arrays for images and Pandas dataframes for tables. We don’t cover all details about these structures now but will explore them in more details in later notebooks. In particular we will come back to dataframes later on.

Native lists#

There are several different data structures that are available out of the box in Python. We have already seen lists. Those can contain any type of data, including other lists:

mylist = ['a', 3, [0,1,3]]

Once defined you can access and change elements by using their index which starts at 0:

mylist[0]
'a'
mylist[2]
[0, 1, 3]
mylist[0] = 'b'
mylist
['b', 3, [0, 1, 3]]

We can alos add elements to a list by appending them removing them by popping them:

mylist.append('new element')
mylist
['b', 3, [0, 1, 3], 'new element']
mylist.pop(1)

mylist
['b', [0, 1, 3], 'new element']

List are typically obtained as an output for example of a function returning the size of objects in an image.

If we think of an image that has multiple lines and columns of pixels, we could now imagine that we can represent it as a list of lists, each single list being e.g. one row pf pixels. For example a 3 x 3 image could be:

my_image = [[4,8,7], [6,4,3], [5,3,7]]
my_image
[[4, 8, 7], [6, 4, 3], [5, 3, 7]]

While in principle we could use a list for this, computations on such objects would be very slow. For example if we wanted to do background correction and subtract a given value from our image, effectively we would have to go through each element of our list (each pixel) one by one and sequentially remove the background from each pixel. If the background is 3 we would have therefore to compute:

  • 4-3

  • 8-3

  • 7-3

  • 6-3

etc. Since operations are done sequentially this would be very slow as we couldn’t exploit the fact that most computers have multiple processors. Also it would be tedious to write such an operation.

To fix this, most scientific areas that use lists of numbers of some kind (time-series, images, measurements etc.) resort to an external package called Numpy which offers a computationally efficient list called an array.

Numpy arrays#

Almost all scientific numerical data are imported as Numpy arrays in the Python world. For example a temperature time-series will be a 1D array, the pixels of an image a 2D array etc. Numpy also offers functions to create such arrays. We have already seen the normal function:

import numpy as np
my_array = np.random.normal(size=(10,5))
my_array
array([[-0.15481049, -0.80405401, -0.43449909, -0.68704696,  2.85879912],
       [-0.41553978,  1.46838885, -0.29594493, -0.37177195,  0.49944154],
       [ 1.73138715,  0.68731187,  0.46140546,  0.36365711,  0.21998944],
       [-0.44938328, -0.38079231, -0.30119992,  0.30452681, -2.02113752],
       [ 0.15430836,  0.67539493, -0.17414324, -0.40051551, -1.13829721],
       [ 1.03612902, -0.87117468, -1.21334862, -1.31875856, -0.34627066],
       [ 1.44726028, -0.93629604,  1.83142843, -0.96262081, -0.39795731],
       [ 1.58733939, -0.65552682, -0.81770623, -0.0419759 ,  0.44763988],
       [ 2.7830692 , -2.21587893, -0.82228344,  0.21661245, -0.88040222],
       [-0.3219425 , -0.66025536, -0.34081356, -1.47760924,  0.03162098]])

We see that we also have [] to specify rows, columns etc. The main difference compared to our list of lists that we defined previously is the array indication at the very beginning of the list of numbers. This array indication tells us that we are dealing with a Numpy array, this alternative type of list of lists that will allow us to do efficient computations. Just as a quick example, if we want to subtract 10 from the list, we can just write:

my_array - 10.
array([[-10.15481049, -10.80405401, -10.43449909, -10.68704696,
         -7.14120088],
       [-10.41553978,  -8.53161115, -10.29594493, -10.37177195,
         -9.50055846],
       [ -8.26861285,  -9.31268813,  -9.53859454,  -9.63634289,
         -9.78001056],
       [-10.44938328, -10.38079231, -10.30119992,  -9.69547319,
        -12.02113752],
       [ -9.84569164,  -9.32460507, -10.17414324, -10.40051551,
        -11.13829721],
       [ -8.96387098, -10.87117468, -11.21334862, -11.31875856,
        -10.34627066],
       [ -8.55273972, -10.93629604,  -8.16857157, -10.96262081,
        -10.39795731],
       [ -8.41266061, -10.65552682, -10.81770623, -10.0419759 ,
         -9.55236012],
       [ -7.2169308 , -12.21587893, -10.82228344,  -9.78338755,
        -10.88040222],
       [-10.3219425 , -10.66025536, -10.34081356, -11.47760924,
         -9.96837902]])

No need to write for loops to go through all values!

We will learn much more on performing computations with these arrays in later chapters.

Other simple structures#

During this course, we will encounter from time to time other types of containers. For example tuples. Those are defined with () and are immutable i.e. we can’t change their values.

mytuple = (3, 'a')
mytuple
(3, 'a')
mytuple[0]
3
mytuple[0] = 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [18], in <module>
----> 1 mytuple[0] = 5

TypeError: 'tuple' object does not support item assignment

The last structures mentionned here is the dictionary, which is a list of key-words and their corresponding value. For example you might get this kind of structure as output of a function that provides different properties of analyzed objects. We define it with curly brackets. Each key-word can contain any type of content.

mydict = {'area': [10, 12, 4], 'object_type': ['cell', 'nucleus', 'cell']}
mydict
{'area': [10, 12, 4], 'object_type': ['cell', 'nucleus', 'cell']}

We can then access each element by key-word (instead of by index like in a list):

mydict['area']
[10, 12, 4]

Exercise#

  1. Create an empty list. Use the append mehod to add a few elements.

  2. Try to exract one element using indices. Make sure you get the correct one!

  3. Create dictionary containing one key for fruite names and one key for the fruit weight and fill it with 3 fruits. Make sure you can recover either the fruit names or the weight.