- Schema file that describes the features’ expected properties.
- Machine-readable (e.g., JSON, CSV, or YAML), versioned, and updated when we update our features.
- Note: we can automate the creation and ingestion of the schema file.
- Creation: Loop over the features (e.g., the column in a DataFrame) and extract the feature name, data type, missing values (% or count), unique values, range, min/max, mean, variance, allowed values (categorical), transformation steps (feature engineering), popularity, etc.
- Deep dive. E.g., popularity:
- Depends on the definition of feature popularity: which metric?
- Missing values: the fewer the missing values in a feature, the more popular that feature is.
- Importance score in a model: score the impact of a feature in a model (e.g., Feature importances with a forest of trees)
- Unique value distribution: the more unique values a feature has, the less useful it can be (e.g., ID).