#Reproducibility
- Avoid transforming the data manually, or using shell commands with regular expressions, “quick and dirty” awk or sed commands, and piped expressions.
- Each step in data collection and transformation has to be implemented as a software script, e.g. python or R scripts with their inputs and outputs.
- Run the entire process as a pipeline (or workflow)
#Data first, algorithm second
Spend most of our effort and time on getting more data of wide variety and high quality, instead of trying to squeeze the maximum out of a learning algorithm.