#Accessible?
- Does the data already exist?
- Is it accessible (physically, contractually, ethically, or from a cost perspective)
- Copyright permissions, etc
- Is the data sensitive (private customer information, government, etc)
- Do you need to anonymize the data, e.g. remove personally identifiable information (PII)?
#Sizeable?
- Usually you only can guess how much data you need from experience.
- How frequently does new data get generated?
- Can you gather the necessary size of data within the time frame of the project?
- To find out how much data you might need, model the model performance based on size and see where you reach a plateau.
- Data Augmentation
#2 reasons for plateau:
- Not enough informative features that your algorithm can leverage to build a more performant model.
- The learning algorithm you used is incapable of training a complex enough model using the data you have.
Some rules of thumb
- 10 times the number of features (often exaggerates)
- 100 or 1000 times the number of classes (often underestimates)
- 10 times the number of trainable parameters (usually applied to neural networks)
Just because you have big data doesn’t mean you should use all of it. A smaller sample can give a better performance. It’s important to ensure that the sample is representative of the whole big dataset through sampling strategies like stratified and systematic sampling.
#Useable?
- Data must be tidy
- We have to fix missing values through Data Imputation techniques.
- We have to handle duplicates.
- Data can be expired or out of date.
- Data can be incomplete or unrepresentative of the phenomenon. e.g. photos of animals only during daylight.
#Understandable?
- It’s important to know where each data attribute came from to avoid Data Leakage and correctly tackle the correct business problem.
#Reliable?
- Can you trust the data labels?
- Delayed/indirect data labels.
- Data obtained through a feedback loop?