Storing and Documenting Features - Yousef's Notes
Storing and Documenting Features

Storing and Documenting Features

  • Schema file that describes the features’ expected properties.
  • Machine-readable (e.g., JSON, CSV, or YAML), versioned, and updated when we update our features.
  • Note: we can automate the creation and ingestion of the schema file.
  • Creation: Loop over the features (e.g., the column in a DataFrame) and extract the feature name, data type, missing values (% or count), unique values, range, min/max, mean, variance, allowed values (categorical), transformation steps (feature engineering), popularity, etc.
  • Deep dive. E.g., popularity:
  • Depends on the definition of feature popularity: which metric?
  • Missing values: the fewer the missing values in a feature, the more popular that feature is.
  • Importance score in a model: score the impact of a feature in a model (e.g., Feature importances with a forest of trees)
  • Unique value distribution: the more unique values a feature has, the less useful it can be (e.g., ID).