File handling#
Data are often stored in multiple files or folders that need to be read. Python has several modules and functions to help handling files and especially file paths that will save you a lot of manual work. We review some of those functionalities.
Pathlib#
Whenever you want to read a file, you need to specify its location on your computer. You can usually do that in a relative or absolute manner:
relative: indicate the location of the file respective to your current location (usually the location of the notebook)
absolute: indicate the full path of your file on your system
You can often specify a path using a simple string, but this can be tedious, as you will for example of the construct paths of subfolders “manually”. We highly recommend to use the pathlib module which provides a lot of very useful tools to handle path names, extensions etc.
First we use the Path object to define the path of the folder containing data. For example we might want to specify that the location is “right here” using a dot:
from pathlib import Path
folder = Path('.')
folder is not just a string containing ‘.’ but an actual object that is much more useful. For example we can ask for the absolute path:
folder = folder.absolute()
or we can ask if the defined path is a folder:
folder.is_dir()
True
folder.is_file()
False
There are also many usueful functions to handle the path itself. For example if you have a folder and a file name, you can simply join them with:
folder.joinpath('environment.yml').is_file()
True
This spares you the hassle of adding slashes, making sure your code will work on an other OS etc.
Listing files#
Now we can use methods attached to this path object to explore its contents. For example we can check the folder contents with iterdir
files_in_folder = folder.iterdir()
As you can see, the returned object is a generator. We haven’t seen yet this object which is very specific to Python. It is a sort of a list whose contents can be queried one after the other, for example using the next statement:
next(files_in_folder)
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/22-Seaborn_distributions_relations.ipynb')
For the moment we just transform this generator into a regular list:
files_in_folder = folder.iterdir()
files_in_folder = list(files_in_folder)
files_in_folder
[PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/22-Seaborn_distributions_relations.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/12-Minimal_plotting.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/30-AI_assistants.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/19-Matplotlib_content.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/environment.yml'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/26-Alternatives.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/14-Back_to_Pandas.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/09-Numpy_arrays.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/images'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/16-DataFrame_indexing.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/Solutions'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/21-Seaborn_concept.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/17-Pandas_combine.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/29-Image_processing.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/05-Data_structures.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/02-Notebooks.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/06-File_handling.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/Readme.md'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/_toc.yml'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/27-scipy_statsmodels.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/13-Images_as_arrays.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/.gitignore'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/24-Matplotlib_statistics.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/23-Seaborn_regression.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/_config.yml'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/08-Classes.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/.github'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/book'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/01-Introduction.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/11-Numpy_indexing.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/10-Numpy_maths.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/28-scikit-learn.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/03-Variables.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/18-Real_world.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/25-Matplotlib_annotations.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/.git'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/15-Operate_on_DataFrames.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/04-Functions_packages.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/07-Flow_control.ipynb'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/plots'),
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy/20-Matplotlib_non_data_elements.ipynb')]
Investigating files#
Our goal will be to analyze all the notebook files in that folder. However we will need to do some clean-up first as some of the files should be discarded.
Again, each of the elements of files_in_folder is a Path object and we can get multiple features such as:
the folder the file belongs to:
files_in_folder[0].parent
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy')
the name of the file:
files_in_folder[0].name
'22-Seaborn_distributions_relations.ipynb'
the two parts of the file: name and extension:
files_in_folder[0].stem
'22-Seaborn_distributions_relations'
files_in_folder[0].suffix
'.ipynb'
While all these elements could be recovered from a path represented as a simple string, the Path object just makes this massively easier, so we definitely recommend to use it!
Other functions and modules#
A few other functionalities are useful to know. For example you can directly find files containing certain sub-texts using the glob function:
folder.glob('*.ipynb')
<generator object Path.glob at 0x106195cf0>
The os module can also be very useful. It gives you a lot of information about your system, includng current location etc. For example:
import os
os.getcwd()
'/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy'
Here we can see that the current location is the folder where this notebook is located. Naturally, we can transform the returned path into a Path object to further manipulate it:
Path(os.getcwd())
PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/DAVPy')
We can also create new directories using the mkdir method:
newfolder = folder.joinpath('newfolder')
newfolder.mkdir(parents=True, exist_ok=True)
Exeracise#
Create a Path object that points ot the main course reposiitory on your computer (the one containing the Da1, Day2 etc. folders).
Create a list of the contents of that directory.
For a few of the files, check if they are files or directories.
Create a new folder in that directory called
myfolder.