Python structures
Contents
Python structures¶
We have now seen that we can define different types of variables and that we can operate on them using either classical mathematical operations or functions and methods. Sometimes we however operate on more than just one variable, and so we need to group them together in a coherent unit.
Python offers several of these groupings, and we are going to look at two of them lists and dictionaries. If you want to proceed using Python, you should definitely study this more in detail, but in this course we are only going to use these two categories.
Lists¶
Creating lists¶
Lists are basically collections of variables. One of the main property of lists is that each element can be modified after the list has been created, so it’s a “dynamic” object.
Lists are surrounded by brackets [] and can be created like this:
mylist = [10, 5, 983, 20]
type(mylist)
list
You can create lists of almost anything, for example strings:
['a', 'b','c']
['a', 'b', 'c']
Or even mix different types, although it’s best to avoid
['a', 10, 23.54]
['a', 10, 23.54]
List indexes¶
The simples operation one can do on lists is to recover some specific value:
mylist
[10, 5, 983, 20]
mylist[2]
983
Note that Python is based on 0 indexing, meaning that the first object has index 0 !
As said before, lists are dynamics objects, so one can reassign values:
mylist[2] = 25
mylist
[10, 5, 25, 20]
Who is who ?¶
An aspect that can be very confusing in Python is that some objects are not really copied when you assign them to a new variable. Let’s clarify this. For example with simple numbers we have:
a = 5
b = a
b
5
If now we modify a
:
a = 10
b
still has the old value:
b
5
Now let’s do something similar with a list. We have a first list:
mylist = [10, 5, 983, 20]
mylist
[10, 5, 983, 20]
Now we copy it to a new list:
mylist2 = mylist
mylist2
[10, 5, 983, 20]
And we modify the original list:
mylist[2] = 10000
mylist
[10, 5, 10000, 20]
What happened to the second list ?
mylist2
[10, 5, 10000, 20]
It was changed too! This is because the two objects mylist
and mylist2
share the same reference. They are not just copies but the same object. If you really want to create an independent copy of the first list, you can use the copy()
method:
mylist2 = mylist.copy()
mylist2
[10, 5, 10000, 20]
Now we can change mylist
without affecting mylist2
:
mylist[1] = 7000
mylist
[10, 7000, 10000, 20]
mylist2
[10, 5, 10000, 20]
Using functions and methods¶
Just like other variables, lists have associated functions and methods. One built-in function that we have already seen is e.g. len():
len(mylist)
4
The list contains indeed four elements. We can again find all associated functions using dir():
dir(mylist)
['__add__',
'__class__',
'__contains__',
'__delattr__',
'__delitem__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__gt__',
'__hash__',
'__iadd__',
'__imul__',
'__init__',
'__init_subclass__',
'__iter__',
'__le__',
'__len__',
'__lt__',
'__mul__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__reversed__',
'__rmul__',
'__setattr__',
'__setitem__',
'__sizeof__',
'__str__',
'__subclasshook__',
'append',
'clear',
'copy',
'count',
'extend',
'index',
'insert',
'pop',
'remove',
'reverse',
'sort']
Some of the methods are important and we will use them very often. For example append()
to add values to a list:
mylist
[10, 7000, 10000, 20]
mylist.append(230)
mylist
[10, 7000, 10000, 20, 230]
We see here that the list has been modified in place, we didn’t have to reassign the result to a new variable. The append function is important as we will often start with an empty list and fill it progressively:
mylist = []
mylist.append(10)
print(mylist)
[10]
mylist.append(30)
print(mylist)
[10, 30]
Dictionaries¶
Sometimes it’s useful to store very diverse information into a single container, and in that case, it is also useful to be able to remember what exactly was stored in that container. For that we can use dictionaries. As the name says those structures are basically composed of pairs of “words” and “definitions”, where the definition can be a number, a string, a list etc… Let’s imagine we have been detecting a cell in an image and want to store its location, size and type. We can define the following dictionary:
mydict = {'location_row': 10, 'location_col': 23, 'surface': 120, 'type': 'embryonic'}
Now whenever we want to recover the cell size, we can find it using:
mydict['surface']
120
Again who’s who ?¶
Dictionaries behave in the same way as lists: a simple copy is not a true copy:
mydict2 = mydict
mydict['surface'] = 5000
mydict2
{'location_col': 23, 'location_row': 10, 'surface': 5000, 'type': 'embryonic'}
mydict
{'location_col': 23, 'location_row': 10, 'surface': 5000, 'type': 'embryonic'}
mydict2
{'location_col': 23, 'location_row': 10, 'surface': 5000, 'type': 'embryonic'}
Also if you adde a new key, it will appear in both copies:
mydict2['test'] = 30
mydict
{'location_col': 23,
'location_row': 10,
'surface': 5000,
'test': 30,
'type': 'embryonic'}
Grouping dictionaries¶
If we imagine that we have three detected cells, we can then group all the information within a list:
mydict = {'location_row': 10, 'location_col': 23, 'surface': 120, 'type': 'embryonic'}
mydict2 = {'location_row': 32, 'location_col': 18, 'surface': 130, 'type': 'embryonic'}
mydict3 = {'location_row': 23, 'location_col': 5, 'surface': 90, 'type': 'embryonic'}
all_cells = [mydict, mydict2, mydict3]
all_cells
[{'location_col': 23, 'location_row': 10, 'surface': 120, 'type': 'embryonic'},
{'location_col': 18, 'location_row': 32, 'surface': 130, 'type': 'embryonic'},
{'location_col': 5, 'location_row': 23, 'surface': 90, 'type': 'embryonic'}]
You can recover all the “words” that are defined using the key() method:
mydict.keys()
dict_keys(['location_row', 'location_col', 'surface', 'type'])
Dataframes¶
What native Python is lacking is a data format that simplifies handling of tabular data and doing statistics on them. Whit this I mean something lile an Excel sheet where you have multiple columns and for example take an average for each column. This type of tabular data is provided by the Pandas package, and its Dataframe structure. We will see here only a tiny fraction of the possibilities offered by Pandas, so read more about it if you think it might help you. Let’s import Pandas:
import pandas as pd
Creating a Dataframe¶
Dataframes can be created from scratch and filled with data. However what will happen most of the time, is that we will get some result in native Python format and will transform it into a dataframe. We can do this immediately with our list of dictionaries. For that we just use the DataFrame()
function:
mydataframe = pd.DataFrame(all_cells)
mydataframe
location_row | location_col | surface | type | |
---|---|---|---|---|
0 | 10 | 23 | 120 | embryonic |
1 | 32 | 18 | 130 | embryonic |
2 | 23 | 5 | 90 | embryonic |
We see that the dataframe is shown in a nicely formatted way. Also since we used a list of dictionaries, Pandas was smart enough to infer for us how the table should be made.
We can also create a dataframe from a 2D list:
all_cells_list = [[10,23,120,'embryonic'],[32,18,130,'embryonic'],[23,5,90,'embryonic']]
print(all_cells_list)
[[10, 23, 120, 'embryonic'], [32, 18, 130, 'embryonic'], [23, 5, 90, 'embryonic']]
Without columns name specification, simple numbers are used as headers and indices.
pd.DataFrame(all_cells_list)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 10 | 23 | 120 | embryonic |
1 | 32 | 18 | 130 | embryonic |
2 | 23 | 5 | 90 | embryonic |
We can pass a second parameter called columns
to specific headers:
pd.DataFrame(all_cells_list, columns=['x','y', 'surf', 'type'])
x | y | surf | type | |
---|---|---|---|---|
0 | 10 | 23 | 120 | embryonic |
1 | 32 | 18 | 130 | embryonic |
2 | 23 | 5 | 90 | embryonic |
Accessing data¶
Let’s remember what’s in mydataframe
:
mydataframe
location_row | location_col | surface | type | |
---|---|---|---|---|
0 | 10 | 23 | 120 | embryonic |
1 | 32 | 18 | 130 | embryonic |
2 | 23 | 5 | 90 | embryonic |
Using dataframes, we can recover entire columns very easily. For example if we want to recover the surface
parameter for all columns we have two choices:
mydataframe.surface
0 120
1 130
2 90
Name: surface, dtype: int64
mydataframe['surface']
0 120
1 130
2 90
Name: surface, dtype: int64
If we want to recover the data of a specific cell, for example of the second row, we have to use the loc[]
method. Note that this method used brackets and not parenthesis:
mydataframe.loc[1]
location_row 32
location_col 18
surface 130
type embryonic
Name: 1, dtype: object
Doing statistics¶
Pandas provides extensive tools to analyze the data contained in a dataframe. We are going to do only very simple operations to illustrate to power of this approach. For example we can easily calculate all the means for each column
mydataframe.mean()
location_row 21.666667
location_col 15.333333
surface 113.333333
dtype: float64
You see that Pandas is smart enough to do the work only on columns that contain numbers. Same thing for median, standard deviation etc.:
mydataframe.std()
location_row 11.060440
location_col 9.291573
surface 20.816660
dtype: float64
mydataframe.median()
location_row 23.0
location_col 18.0
surface 120.0
dtype: float64
If we don’t want to do the caluclation for the entire table, we can alos just do it for one variable:
mydataframe.surface.mean()
113.33333333333333
or
mydataframe['surface'].mean()
113.33333333333333
Plotting with Pandas¶
A last nice feature of Pandas is the possibility to direcly plot data without using the classical Matplotlib commands:
mydataframe.surface.plot();
