To divide your data in training and holdout sets, the partitioning has to satisfy several conditions:
- Split was applied to raw data Do the split before everything else to avoid Data Leakage
- Data was randomized before the split.
- Validation and test sets follow the same distribution.
- Leakage during the split was avoided.
A small dataset of less than a thousand examples would do best with 90% of the data used for training. In this case, we might decide not to have a distinct validation set, and instead simulate with the Cross-Validation technique.
#Leakage During Partitioning
Group leakage may occur during partitioning.
e.g. if you have brain scans of patients, and multiple scans for each patient, scans of the same patient might end up in both training and holdout sets, training on the specifics of the patient.
The solution to group leakage is group partitioning.