Python structures

We have now seen that we can define different types of variables and that we can operate on them using either classical mathematical operations or functions and methods. Sometimes we however operate on more than just one variable, and so we need to group them together in a coherent unit.

Python offers several of these groupings, and we are going to look at two of them lists and dictionaries. If you want to proceed using Python, you should definitely study this more in detail, but in this course we are only going to use these two categories.

Lists

Creating lists

Lists are basically collections of variables. One of the main property of lists is that each element can be modified after the list has been created, so it’s a “dynamic” object.

Lists are surrounded by brackets [] and can be created like this:

mylist = [10, 5, 983, 20]
type(mylist)
list

You can create lists of almost anything, for example strings:

['a', 'b','c']
['a', 'b', 'c']

Or even mix different types, although it’s best to avoid

['a', 10, 23.54]
['a', 10, 23.54]

List indexes

The simples operation one can do on lists is to recover some specific value:

mylist
[10, 5, 983, 20]
mylist[2]
983

Note that Python is based on 0 indexing, meaning that the first object has index 0 !

As said before, lists are dynamics objects, so one can reassign values:

mylist[2] = 25
mylist
[10, 5, 25, 20]

Who is who ?

An aspect that can be very confusing in Python is that some objects are not really copied when you assign them to a new variable. Let’s clarify this. For example with simple numbers we have:

a = 5
b = a
b
5

If now we modify a:

a = 10

b still has the old value:

b
5

Now let’s do something similar with a list. We have a first list:

mylist = [10, 5, 983, 20]
mylist
[10, 5, 983, 20]

Now we copy it to a new list:

mylist2 = mylist
mylist2
[10, 5, 983, 20]

And we modify the original list:

mylist[2] = 10000
mylist
[10, 5, 10000, 20]

What happened to the second list ?

mylist2
[10, 5, 10000, 20]

It was changed too! This is because the two objects mylist and mylist2 share the same reference. They are not just copies but the same object. If you really want to create an independent copy of the first list, you can use the copy() method:

mylist2 = mylist.copy()
mylist2
[10, 5, 10000, 20]

Now we can change mylist without affecting mylist2:

mylist[1] = 7000
mylist
[10, 7000, 10000, 20]
mylist2
[10, 5, 10000, 20]

Using functions and methods

Just like other variables, lists have associated functions and methods. One built-in function that we have already seen is e.g. len():

len(mylist)
4

The list contains indeed four elements. We can again find all associated functions using dir():

dir(mylist)
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

Some of the methods are important and we will use them very often. For example append() to add values to a list:

mylist
[10, 7000, 10000, 20]
mylist.append(230)
mylist
[10, 7000, 10000, 20, 230]

We see here that the list has been modified in place, we didn’t have to reassign the result to a new variable. The append function is important as we will often start with an empty list and fill it progressively:

mylist = []
mylist.append(10)
print(mylist)
[10]
mylist.append(30)
print(mylist)
[10, 30]

Dictionaries

Sometimes it’s useful to store very diverse information into a single container, and in that case, it is also useful to be able to remember what exactly was stored in that container. For that we can use dictionaries. As the name says those structures are basically composed of pairs of “words” and “definitions”, where the definition can be a number, a string, a list etc… Let’s imagine we have been detecting a cell in an image and want to store its location, size and type. We can define the following dictionary:

mydict = {'location_row': 10, 'location_col': 23, 'surface': 120, 'type': 'embryonic'}

Now whenever we want to recover the cell size, we can find it using:

mydict['surface']
120

Again who’s who ?

Dictionaries behave in the same way as lists: a simple copy is not a true copy:

mydict2 = mydict
mydict['surface'] = 5000
mydict2
{'location_col': 23, 'location_row': 10, 'surface': 5000, 'type': 'embryonic'}
mydict
{'location_col': 23, 'location_row': 10, 'surface': 5000, 'type': 'embryonic'}
mydict2
{'location_col': 23, 'location_row': 10, 'surface': 5000, 'type': 'embryonic'}

Also if you adde a new key, it will appear in both copies:

mydict2['test'] = 30
mydict
{'location_col': 23,
 'location_row': 10,
 'surface': 5000,
 'test': 30,
 'type': 'embryonic'}

Grouping dictionaries

If we imagine that we have three detected cells, we can then group all the information within a list:

mydict = {'location_row': 10, 'location_col': 23, 'surface': 120, 'type': 'embryonic'}
mydict2 = {'location_row': 32, 'location_col': 18, 'surface': 130, 'type': 'embryonic'}
mydict3 = {'location_row': 23, 'location_col': 5, 'surface': 90, 'type': 'embryonic'}

all_cells = [mydict, mydict2, mydict3]
all_cells
[{'location_col': 23, 'location_row': 10, 'surface': 120, 'type': 'embryonic'},
 {'location_col': 18, 'location_row': 32, 'surface': 130, 'type': 'embryonic'},
 {'location_col': 5, 'location_row': 23, 'surface': 90, 'type': 'embryonic'}]

You can recover all the “words” that are defined using the key() method:

mydict.keys()
dict_keys(['location_row', 'location_col', 'surface', 'type'])

Dataframes

What native Python is lacking is a data format that simplifies handling of tabular data and doing statistics on them. Whit this I mean something lile an Excel sheet where you have multiple columns and for example take an average for each column. This type of tabular data is provided by the Pandas package, and its Dataframe structure. We will see here only a tiny fraction of the possibilities offered by Pandas, so read more about it if you think it might help you. Let’s import Pandas:

import pandas as pd

Creating a Dataframe

Dataframes can be created from scratch and filled with data. However what will happen most of the time, is that we will get some result in native Python format and will transform it into a dataframe. We can do this immediately with our list of dictionaries. For that we just use the DataFrame() function:

mydataframe = pd.DataFrame(all_cells)
mydataframe
location_row location_col surface type
0 10 23 120 embryonic
1 32 18 130 embryonic
2 23 5 90 embryonic

We see that the dataframe is shown in a nicely formatted way. Also since we used a list of dictionaries, Pandas was smart enough to infer for us how the table should be made.

We can also create a dataframe from a 2D list:

all_cells_list = [[10,23,120,'embryonic'],[32,18,130,'embryonic'],[23,5,90,'embryonic']]
print(all_cells_list)
[[10, 23, 120, 'embryonic'], [32, 18, 130, 'embryonic'], [23, 5, 90, 'embryonic']]

Without columns name specification, simple numbers are used as headers and indices.

pd.DataFrame(all_cells_list)
0 1 2 3
0 10 23 120 embryonic
1 32 18 130 embryonic
2 23 5 90 embryonic

We can pass a second parameter called columns to specific headers:

pd.DataFrame(all_cells_list, columns=['x','y', 'surf', 'type'])
x y surf type
0 10 23 120 embryonic
1 32 18 130 embryonic
2 23 5 90 embryonic

Accessing data

Let’s remember what’s in mydataframe:

mydataframe
location_row location_col surface type
0 10 23 120 embryonic
1 32 18 130 embryonic
2 23 5 90 embryonic

Using dataframes, we can recover entire columns very easily. For example if we want to recover the surface parameter for all columns we have two choices:

mydataframe.surface
0    120
1    130
2     90
Name: surface, dtype: int64
mydataframe['surface']
0    120
1    130
2     90
Name: surface, dtype: int64

If we want to recover the data of a specific cell, for example of the second row, we have to use the loc[] method. Note that this method used brackets and not parenthesis:

mydataframe.loc[1]
location_row           32
location_col           18
surface               130
type            embryonic
Name: 1, dtype: object

Doing statistics

Pandas provides extensive tools to analyze the data contained in a dataframe. We are going to do only very simple operations to illustrate to power of this approach. For example we can easily calculate all the means for each column

mydataframe.mean()
location_row     21.666667
location_col     15.333333
surface         113.333333
dtype: float64

You see that Pandas is smart enough to do the work only on columns that contain numbers. Same thing for median, standard deviation etc.:

mydataframe.std()
location_row    11.060440
location_col     9.291573
surface         20.816660
dtype: float64
mydataframe.median()
location_row     23.0
location_col     18.0
surface         120.0
dtype: float64

If we don’t want to do the caluclation for the entire table, we can alos just do it for one variable:

mydataframe.surface.mean()
113.33333333333333

or

mydataframe['surface'].mean()
113.33333333333333

Plotting with Pandas

A last nice feature of Pandas is the possibility to direcly plot data without using the classical Matplotlib commands:

mydataframe.surface.plot();
_images/Appendix_Structures_103_0.png