The 5 Key Steps To Building Out An ML Feature Store
Whether it’s ease of access, auditability or the cost benefits – it’s clear that an ML feature store is the right fit for your business.
This leaves the next step being to actually build out that feature store.
[If you haven’t made a decision, use our free eBook to help guide you!].
- But how does one go about doing it?
- What are the necessary steps to take?
- And in what order should you take them?
Let’s First Start With A Few Considerations
Do You Have Any Existing Feature Stores?
One of the first considerations to make before building an ML feature store is to identify whether or not a candidate feature store exists.
Or, at the minimum, that you have data that is centralised and readily available for use.
What Custom Features Might Exist In Your Dev Environments?
Models might exist on local “dev” environments such as notebooks on Data Scientists’ individual machines.
This is important, because those models often have custom features feeding into them, which might be useful for other models to consume.
5 Steps To Building An ML Feature Store
Step 1: Catalogue Data
The first step to take is to catalogue and create ease of access to the datasets that you plan to work with.
- Help planning for data orchestration.
- Inform on whether there are access gaps to data.
- Find potential gaps in platform architecture.
What to do:
- Catalogue datasets that you might require for features, as well as their location.
- Ensure that the ML feature store environment has access/connectivity to your source data.
- Catalogue fields/information that will, or might, be required to build features.
- Best case scenario (wishful thinking): Ingest all of your source data into a centralised location. This is for ease of access downstream. Also bear in mind that this approach might replicate data, so it is not the ideal solution for large (enterprise) datasets.
Step 2: Scope Features
Your next step includes scoping and identifying any (and all) existing features. These might be features used by a model that currently exists, or features that you think would be useful to have.
This will help you: understand what features are already available; think about how to group features together (logical feature groups); plan for how new features will join existing feature groups.
What To Do:
- Take note of any features that can be of direct use from your existing candidate feature store.
- Make sure that you’re keeping track of any features that might already be used in development models that now need to be “productionised”.
- Explore the datasets that you have available and prototype new features that could be useful for machine learning.
Considerations to make:
- Do we need any of those features?
- Are there any duplicates that exist?
- Can we generate new models now that we potentially have access to more data than our localised “development” environments?
Step 3: Build Feature Logic
Feature logic refers to the algorithms or code that will determine how a specific feature is computed from the data.
In this step, you will take the features that you defined in Step 2 and write the feature logic to compute them. You will also need to consider how the features will be stored, whether they need to be materialised or if they can simply be computed on demand.
What to do:
- Implement transformations to calculate features. We recommend using scalable big data frameworks, like Apache Spark or Apache Beam.
- Match infrastructure/service choices to business requirements: (training/ batch) and online (realtime prediction/ serving) modes. This determines the needed speed of feature creation and hence the technology selection.
- Save feature metadata on execution.
- Make sure to consider/integrate versioning.
- Consider parameters that influence cost.
Why It Matters:
- Makes features available for model
- Ensures the latency of feature availability matches business requirements
- Enables scalable solutions both in compute and development effort.
- Allows you to detect data drift. Which could help prevent degradation of model performance and/or unforeseen biases.
Step 4: Orchestration
Orchestration is the process of managing and scheduling the feature creation process. This includes both the management of resources required to create the features, as well as the dependencies between different features.
This step can be done manually or by automation. Automation will likely require some customisation depending on your specific use case and platform architecture.
Why it matters:
- Automates deployment of features
- Optimising work flows for speed and efficiency
- Builds for reliable execution and robustness
How to do it:
- Orchestrate various stages of transformations to run in parallel, or sequentially, as required (particularly for batch processes).
- Set timing intervals or define trigger actions that would kick off batch jobs.
- Isolate and provision infrastructure for features that would need to run in realtime (online features).
- Build in checks and exception handling.
Step 5: Model integration and testing
The final step is to integrate the features that you create into your machine learning models.
This will involve testing to ensure that the features are of high enough quality and that they do not introduce any bias into the model.
What to do:
- Perform sensitivity analysis on your inputs/features.
- Feature feedback: keep ones that work, remove those that do not.
- Feature exploration: Generating more (automatically?) or combining/consolidating.
Why it matters:
- Uses the model to learn about the problem and improve or generate more features (feedback loop).
- Optimises for model performance (predictability, efficiency, cost etc.).
- Tests full workflow from features through to model predictions.
Looking For More On ML?
We can help you with that!
Our blog has plenty of content that you can explore, and we have an awesome free eBook on Feature Stores.