class: center, middle, inverse, title-slide # Best Practices for Writing Reproducible Code ### Barbara Vreede, Utrecht University Library --- These slides are at: [tinyurl.com/UED2020BP](https://bvreede.github.io/presentations/presentations/2020-09-17_BestPractices-lecture.html#1) --- # Reproduction is vital .pull-left[  ] -- .pull-right[ Reproduction is a vital contribution to science: - A reproducible project reduces research waste - Reproducing projects identifies the giants on whom to stand. ] .footnote[ Source: [1,500 scientists lift the lid on reproducibility](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970), Nature (2016) ] --- # The Four Facets of Reproducibility .pull-left-medium[ #### Documentation What do you need to execute this project? Where do you start? #### Organization Demonstrate a trustworthy workflow. #### Automation Automated analyses trace your steps, and prevent human error (or at the very least: document it). #### Dissemination Share your data, release your code, publish your findings.] -- .pull-right-medium[  ] --- # Getting started: project management .pull-left[ - Contain your project in a single recognizable folder - Distinguish folder _types_, name them accordingly: - **Read-only**: data, metadata - **Human-generated**: code, paper, documentation - **Project-generated**: clean data, figures, models... - Initialize a **README** file, document your project - Choose a **license** (more on that later) ] .pull-right[  [Wilson _et al._ (2017)](https://doi.org/10.1371/journal.pcbi.1005510) ] --- # A note on paths - A project, contained in a single folder, should be transportable between computers. - For this reason, you should use **relative paths** only: compare - `/Users/barbara/Dropbox/proteindomains/data/zincfinger.json` - `./data/zincfinger.json` - `./` means: in this folder; `../` means: one folder up - In case you cannot use relative paths, make sure your **absolute** folder location is retrieved once: - in a config file, with clear instructions to edit - OR pulled automatically from the system when the project is launched: e.g.: ```python import os projectdir = os.getcwd() ``` --- # Aspects of good quality code .pull-left[ - Readable - Reusable - Robust ] .footnote[ Source: [xkcd](https://xkcd.com/1513/) ] .pull-right[  ] --- # Code readability - Code is for computer, comments are for humans. --- # Code readability - ~~Code is for computer, comments are for humans.~~ - use descriptive names for functions and variables - start functions with a verb - make variable names _just_ long enough to be meaningful -- #### Compare: ```python for i in my_shopping_basket: if(test(i)) > 10: purch(i) else: disc(i) ``` ```python for item in basket: if(testNecessity(item)) > 10: purchase(item) else: discard(item) ``` --- # Code readability - ~~Code is for computer, comments are for humans.~~ - use descriptive names for functions and variables - start functions with a verb - make variable names _just_ long enough to be meaningful - use a consistent style - consistency will make your code easier to understand and maintain - consult a [style guide](https://www.python.org/dev/peps/pep-0008/) for your language (keep conventions, and don't reinvent the wheel) -- #### Compare: ```r myVar=original_variable+MOD(new.var) ``` ```r my_var = original_var + Modified(new_var) ``` --- # Code reusability: some guidelines - Separate code and data: data is specific, code need not be - consider using a config file for project-specific (meta)data - but DO hard-code unchanging variables, e.g. `gravity = 9.80665`, **once**. -- - Do One Thing (and do it well) - One function for one purpose - One class for one purpose - One script for one purpose (no copy-pasting to recycle it!) -- - Don't Repeat Yourself: use functions - Write routines in functions, i.e., code you reuse often - Identify potential functions by action: functions perform tasks (e.g. sorting, plotting, saving a file, transform data...) --- # Code reusability through functions Functions are smaller code units reponsible of one task. - Functions are meant to be reused - Functions accept arguments (though they may also be empty!) - What arguments a function accept is defined by its parameters Functions do not necessarily make code shorter (at first)! Compare: ```python indexATG = [n for n,i in enumerate(myList) if i == 'ATG'] indexAAG = [n for n,i in enumerate(myList) if i == 'AAG'] ``` ```python def indexString(inputList,z): zIndex = [n for n,i in enumerate(li) if i == z] return zIndex indexATG = indexString(myList,'ATG') indexAAG = indexString(myList,'AAG') ``` --- # Think in building blocks! .pull-left[ Small, cohesive units are much better than... ] -- .pull-right[ ... a customized behemoth! ] --- # Code robustness  --- # Code robustness  --- # Code robustness  --- # Code robustness #### Protect the user* with error management: - Make assumptions and expectations explicit. - check values before processing them - identify and manage exceptions - Produce errors when expectations are not met. - Consider error options, and perform error management: - redirect the program - log or report the error, to allow the user (or developer) to troubleshoot - if necessary: abort the run -- #### Protect the developer* with tests: - Use errors and error management to test expected behavior - Capture unexpected errors to identify further options for error management -- .footnote[ *) both of these are, first of all, you ;-) ] --- # Throwing an error ```python def read_vector_value(index=0, my_vector=[10,5,4,12,25]): if index > len(my_vector) - 1: raise IndexError('Index higher than vector length.') return my_vector[index] read_vector_value(index=6) ``` ``` ## Traceback (most recent call last): ## File "<string>", line 6, in <module> ## File "<string>", line 3, in read_vector_value ## IndexError: Index higher than vector length. ``` -- #### Why not simply adjust the function output? ```python def read_vector_value(index=0, my_vector=[10,5,4,12,25]): if index > len(my_vector) - 1: return None return my_vector[index] print(read_vector_value(index=6)) ``` ``` ## None ``` -- _Because it is unclear if `None` is expected behavior or indicative of a problem._ --- # Redirecting with exceptions If you do not want to interrupt your script when an error is raised: use try/except. NB: Note that Python allows you to distinguish by error type! ```python def read_vector_value(index=0, my_vector=[10,5,4,12,25]): if index > len(my_vector) - 1: raise IndexError('Index higher than vector length.') return my_vector[index] try: read_vector_value(6) except IndexError: print("This is an exception") ``` ``` ## This is an exception ``` --- # Unit tests - Several methods available in Python - One library that works well: `unittest` ```python import unittest def addNumber(x,y): z = x + y return(z) class TestDummy(unittest.TestCase): def test_addNumber(self): """Test if the addition works properly""" self.assertEqual(addNumber(3,4),7) unittest.main() ``` ``` ## . ## ---------------------------------------------------------------------- ## Ran 1 test in 0.000s ## ## OK ``` --- # Project documentation #### Why do you need documentation? - You want yourself to understand how code written some time ago works - You want others to understand how to (re-)use your code, and/or reproduce your project -- #### For this you need to - Explain what to install and how to get started in your __readme__ - Explain (parts of) your code with __comments__ - Explain in-depth use of your code in a __notebook__ --- # README The README page is the first thing your user will see! The contents typically include one or more of the following: > - __Configuration instructions__ - __Installation instructions__ - __Operating instructions__ - A file manifest (list of files included) - Copyright and licensing information - Contact information for the distributor or programmer - Known bugs - Troubleshooting - Credits and acknowledgments *From wikipedia's [Readme]("https://en.wikipedia.org/w/index.php?title=README&oldid=923233067") page* --- # An example  --- # README examples and templates Some examples from projects with high quality documentation: Bigger community software projects: - Python's [Pandas](https://github.com/pandas-dev/pandas) - [Scikit-learn](https://github.com/scikit-learn/scikit-learn) - Tidyverse's [Dplyr](https://github.com/tidyverse/dplyr) Research software: - e-science center's ['McFly' tool](https://github.com/NLeSC/mcfly) - Utrecht University's: [Automatic Systematic Review](https://github.com/msdslab/automated-systematic-review) - Utrecht University's [MICE](https://github.com/stefvanbuuren/mice) Templates and ideas: - https://gist.github.com/PurpleBooth/109311bb0361f32d87a2 - https://github.com/matiassingers/awesome-readme --- # Comments #### Comments are annotations you write directly in the code source. .pull-left[ They: - are written for users who deal with your source code - explain parts that are not intuitive from the code itself - **do not** replace readable or structured code - (in a specific structure) can be used to directly generate documentation for users. .footnote[Comic source: [Geek & Poke](https://geekandpoke.typepad.com/geekandpoke/2011/06/code-commenting-made-easy.html)] ] .pull-right[  ] --- # When *not* to use comments - ...to repeat in natural language what is written in your code ```python # Now we check if the age of a patient is greater than 18 if agePatient > 18: ``` -- - ...to turn old code into zombie code (fine for troubleshooting, but do not leave it in!) ```python # Do not run this!! # connectionType = optimizeMulticoreDeepLearning(myProteins) # if connectionType == 1444: # connection = connectToHPC(currentUser, password) ``` -- - ...to replace version control, like git ```python # removed on August 5 # if connectionType == 1444: ... # Now, it connects to the API with o-auth2, updated 05/05/2016 ``` --- # Comment lines: WHY over HOW Comment lines are used to explain the **purpose** of some piece of code. ```python # Bug fix GH 20601 # If the data frame is too big, the number of unique index combination # will cause int32 overflow on windows environments. # We want to check and raise an error before this happens num_rows = np.max([index_level.size for index_level in self.new_index_levels]) num_columns = self.removed_level. ``` *From [Pandas reshape.py documentation](https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/reshape.py)* --- # Docstrings - Structured comments, associated to *segments* (rather than lines) of code, can be used to generate documentation for users* of your project. - These comments are called *docstrings*. - Docstrings are parsed as the first statement of a module (e.g. a function or class). - Docstrings allow you to provide documentation to a function, that is relevant to the user of that function. - Writing docstrings makes you generate your documentation as you are generating the code: efficiently, comprehensively! .footnote[*) Remember? That's probably you!] --- # Generating docstrings In Python, docstrings are string literal comments following a function declaration: ```python def multiply(x,y): """ Multiply two numbers This function takes two input numbers and multiplies them. It returns the multiplied result. Keyword arguments: x -- the first value y -- the second value """ return x*y ``` NB: a triple single quote (`'''`) works, but PEP style prefers double quotes for docstrings. --- # Take home #### Good code is Readable, Re-usable, and Robust - Invest in code readability - Think in building blocks, and use functions - Make your assumptions explicit #### Documentation is an integral part of your project - Readme introduces the user to your project - Comments explain the code and its use - Notebooks can be used for in-depth exploration #### Use good project management to your advantage - Project is located in a single folder - Sub-folders are either __human-writable__, __read-only__, or __project-generated__ --- # Additional things to look out for - Use version control - Choose a license - Beware of dependencies --- # Use version control (like Git!) .pull-left[ - It efficiently manages all versions of ~~your code~~ most of your files. - It is easy to trace changes from one version to the next. - It allows you to trace back your steps: if something breaks, you can figure out what happened. - Branches facilitate experimentation. - Using a service like GitLab, it allows you to collaborate and share your code. - you can integrate and automate code tests, saving yourself (a lot of) time! ] .pull-right[  ] --- # Some Git resources, for reference - A [Software Carpentry course on git](https://swcarpentry.github.io/git-novice/) - A [version control + git tutorial](https://www.atlassian.com/git/tutorials/what-is-version-control) on Atlassian - A [git cheatsheet](https://www.atlassian.com/git/tutorials/atlassian-git-cheatsheet) from Atlassian - A book with [all the ins and outs of git](https://www.git-scm.com/book/en/v2) from the git website  --- # Choosing a license - Copyright is implicit; others cannot use your code without your permission. - Licensing gives that permission, and its boundaries and conditions. - Choosing a license early on means being aware of your license as the project proceeds (and not creating conflicts). - There are over [80 OSI-approved licenses](https://opensource.org/licenses/alphabetical) (and [many](http://dbad-license.org), [many](http://www.wtfpl.net) others) to choose from. This is one I like to use:  What is important to you? What does your lab use? [Choose your own license!](http://choosealicense.com) --- # Dependencies Dependencies and versions can stop your users/readers from being able to run your code. For example: this code written in Python 2.7: ```python print "Hello world!" ``` ``` ## Hello world! ``` -- No longer works in Python 3! ```python print "Hello world!" ``` ``` SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Hello world!")? ``` -- Instead, we write: ```python print("Hello world!") ``` ``` ## Hello world! ``` --- # The reproducibility trade-off How far do you go towards reproducibility?  - **due diligence** starts at declaring dependencies: declare how your project works **for you**, in a requirements file or README. - What language, what version? - What packages/libraries do you load - What OS do you use? (Does it work on your collaborator's system?) - packaging dependencies with e.g. [pipenv](https://packaging.python.org/tutorials/managing-dependencies/). - containers are awesome. Consider container tools like [Docker](https://www.docker.com/) and [Singularity](https://sylabs.io/). - But! Even more user-friendly options exist: like [CodeOcean](https://codeocean.com) (`$$$`) and [binder](https://mybinder.org/) (free!). --- # Do we have some time? Take this to practice your new found skills! Choose one of these exercises: #### Project Structure - Design with your group a folder structure for your project. - Determine which subfolders are (or will be) read-only, human-editable, or project-generated. #### Readme - Look at the Readme page of your project (do you have one?!) - Take a look at the Readme examples in the slides - Determine subjects you want to address (or already address them, if there is time!) #### Bonus: fork the repository and use pull requests and merge in GitLab to work together!