What makes a research project reproducible is not a simple question… Nonetheless I do believe than one should be able to reproduce ones’ own analysis without pain, even in the future. This may sound all obvious, but not be so easy to achieve in practice!

Here are some tips for students, based on my iterations toward more reproducible practices:

  • Start from the beginning: although it may sometimes feel like a waste of time, putting yourself in the situation of readily redoing your analysis will make your researcher’s life easier (and to the very least help you to rerun chunks from earlier work).

  • For each usual task (data analysis, plotting, citing references), you should master one tool down to its dirty details.

  • Relying on text-based formats (e.g. Markdown, LaTeX, CSV) is critical in order to be able to use version control to maintain your code, to write manuscripts, etc. This may indeed guide your choice of tools. A good starting points ot learn version control with git is the Software Carpentry tutorial. GitHub’s documentation provides help on more advanced topics.

handling data

  • For data analysis and plotting, I enjoy very much (most of the time!) working with R’s tidyverse, in particular dplyr and ggplot2. If you are new to R or to data analysis from the command line, the companion book R for Data Science (available online) is the best introduction you can dream of. My advice: study sections 2 to 8 thoroughly, the next ones will be useful to go deeper on specific topics based on your needs.
    Hint: if you need to speed up your analysis with dplyr have a look at its parallelized counterpart multidplyr.

  • Dont overlook RStudio’s cheatsheets!

  • Follow simple guidelines when recording data in spreadsheets

  • Use regular expressions whenever you can. Regexs are great, regexs are tough, and regexs are poorly taught (if at all!) unless you’ve a computer science background: luckily Damian Conway’s presentations are eye-opening (e.g. this 50’ video) and there is a great cheatsheet for R.

  • To share large datasets, Zenodo is a great (free) service. If you use another one, make sure that your dataset gets a DOI.

handling text

  • Despite LaTeX’s popularity in quantitative fields, I believe that the time is ripe to leave it to advanced editing where microtypography matters… Simpler syntaxes (in particular Markdown) are sufficient for literate data analysis (e.g. with Rmarkdown or R notebooks) and even for more advance tasks like writing a dissertation or an article. Whatever format you choose to rely on, don’t miss that pandoc is an incredibly powerful conversion tools between most formats (.md, .tex, .rtf, .docx, etc).
    Hint: the Markdown converter used by RStudio (pandoc-citeproc) is able to handle citations just like bibtex would (and in fact simpler!).

  • For storing and citing articles, Zotero is the most versatile open-source software.

handling DNA sequences

  • Benchling is the 21st century sequence editor to design and keep track of your molecular biology experiments: design primers, align sequencing chromatograms, test your next cloning in silico. It even as an integrated lab notebook!
    Big drawback: your data must be hosted on their servers…

handling microscopy data

academia survival kit