Machine Learning (ML) Feature Store System

A Machine Learning (ML) Feature Store System is an specialized database system (for ML features) that can support feature store tasks.

Context:
- It can (often) be a part of an ML Modeling System.
- It can range from being an Offline Feature Store to being an Online Feature Store.
- It can be based on an ML Feature Repository Platform, such as AWS' SageMaker Feature Store.
- It can make use of a ML Feature Engineering Framework.
- It can support Feature Store Tasks, such as:
  - ML Feature Generation.
  - ML Feature Discovery.
  - ML Feature Computation, backfills, and logging
  - ML Feature Sharing, and ML Feature Reuse.
  - store, query and manage ML Feature Data (ML Feature Offline Data, ML Feature Online Data).
  - store and manage ML Feature Version, ML Feature Lineage, and ML Feature Metadata.
  - ensure consistency between ML Feature Training Data and ML Feature Serving Data.
  - monitor ML Feature Data Quality (data pipeline health monitoring).
  - support ML Feature Generation Data Pipeline (data pipeline development, data pipeline automation, data pipeline monitoring, data pipeline alerting).
- …
Example(s):
- AirBnB's Feature Store (based on Zipline).
- Uber's Palette Feature Store (within Uber's Michelangelo Platform).
- Intuit's Feature Store (based on ...).
- PlayStation's PSN Engineering's Feature Store (based on ...).
- … one based on AWS' SMFS Platform.
- … one based on GCP Vertex AI Feature Store.
- …
Counter-Example(s):
- ML Feature Selection Platform, such as [1]
- ML Model Training Platform.
See: ML Feature Generation Best Practice, ML Platform, Materialized View, Customer-Centric Feature Space.

References

2020

https://www.tecton.ai/blog/what-is-a-feature-store/
- QUOTE: Feature stores make it easy to:
  - Productionize new features without extensive engineering support
  - Automate feature computation, backfills, and logging
  - Share and reuse feature pipelines across teams
  - Track feature versions, lineage, and metadata
  - Achieve consistency between training and serving data
  - Monitor the health of feature pipelines in production
- Feature stores aim to solve the full set of data management problems encountered when building and operating operational ML applications.
- A feature store is an ML-specific data system that:
  - Runs data pipelines that transform raw data into feature values
  - Stores and manages the feature data itself, and
  - Serves feature data consistently for training and inference purposes
- …

2017b

"Meet Michelangelo: Uber’s Machine Learning Platform." 2017-07-05
- QUOTE: We found great value in building a centralized Feature Store in which teams around Uber can create and manage canonical features to be used by their teams and shared with others. At a high level, it accomplishes two things:
  1. It allows users to easily add features they have built into a shared feature store, requiring only a small amount of extra metadata (owner, description, SLA, etc.) on top of what would be required for a feature generated for private, project-specific usage.
  2. Once features are in the Feature Store, they are very easy to consume, both online and offline, by referencing a feature’s simple canonical name in the model configuration. Equipped with this information, the system handles joining in the correct HDFS data sets for model training or batch prediction and fetching the right value from Cassandra for online predictions.
- At the moment, we have approximately 10,000 features in Feature Store that are used to accelerate machine learning projects, and teams across the company are adding new ones all the time. Features in the Feature Store are automatically calculated and updated daily. In the future, we intend to explore the possibility of building an automated system to search through Feature Store and identify the most useful and important features for solving a given prediction problem.
- Domain specific language for feature selection and transformation.
  Often the features generated by data pipelines or sent from a client service are not in the proper format for the model, and they may be missing values that need to be filled. Moreover, the model may only need a subset of features provided. In some cases, it may be more useful for the model to transform a timestamp into an hour-of-day or day-of-week to better capture seasonal patterns. In other cases, feature values may need to be normalized (e.g., subtract the mean and divide by standard deviation).
  To address these issues, we created a DSL (domain specific language) that modelers use to select, transform, and combine the features that are sent to the model at training and prediction times. The DSL is implemented as sub-set of Scala.

2017a

"Using Machine Learning to Predict Value of Homes On Airbnb." 2017-07-17
- QUOTE: One of the first steps of any supervised machine learning project is to define relevant features that are correlated with the chosen outcome variable, a process called feature engineering. For example, in predicting LTV, one might compute the percentage of the next 180 calendar dates that a listing is available or a listing’s price relative to comparable listings in the same market.
  At Airbnb, feature engineering often means writing Hive queries to create features from scratch. However, this work is tedious and time consuming as it requires specific domain knowledge and business logic, which means the feature pipelines are often not easily sharable or even reusable. To make this work more scalable, we developed Zipline — a training feature repository that provides features at different levels of granularity, such as at the host, guest, listing, or market level.
  The crowdsourced nature of this internal tool allows data scientists to use a wide variety of high quality, vetted features that others have prepared for past projects. If a desired feature is not available, a user can create her own feature with a feature configuration file like the following:

2015

http://feature-forge.readthedocs.io/en/latest/feature_definition.html
- QUOTE: ...