data scientist or ML engineer, learning machine learning system design is one of the most essential skills you need to know. It’s the bridge between building models and deploying solutions that drive actual business outcomes.
The ability to turn ML ideas into production systems that save money, boost revenue, and create measurable value determines your long-term career growth and your salary.
I’ve built machine learning systems that have saved companies over $1.5 million per year, and these same skills have helped me land job offers exceeding $100,000.
In this guide, I’ll break down how I think about ML system design so you can do the same.
General Framework
Below is my framework on how to approach designing a machine learning system:
Note: This is the most common design type for an applied machine learning system in an established tech company. There are other, more nuanced cases, like infrastructure design and AI/ML research experiment design.
If you want a PDF copy of this template, you can get access using this link:
https://framework.egorhowell.com
Let’s break down these steps in a bit more detail.
Business Problem
The goal of this step is to:
- Clarify objectives — What is the business or user problem you’re trying to solve, and how to translate that to a machine learning solution?
- Define metrics — What metrics are we targeting: Accuracy, F1-score, ROC-AUC, precision/recall, RMSE, etc and how that translates to business performance.
- Constraints and scope — How much compute resource is available, do we want live-time predictions or batch inference, do we even need machine learning?
- High-level design — What will the rough architecture look like from data to inference?
Data
This is all about gathering and acquiring data:
- Identify data sources — Databases, APIs, logs, or user-generated data.
- Identify target variable — What is the target variable and how do we get it?
- Quality control — What state is the data in? Are there any legal issues with using the data?
Feature Engineering
Create novel features from the data to tackle the specific problem:
- Feature importance — Understanding what features are likely to drive the target variable.
- Data cleaning — Handle missing values, outliers, and inconsistent entries.
- Feature representation — One-hot encoding, target encoding, embeddings, and scaling the data.
- Sampling and splits — Account for unbalanced datasets, data leakage, and correctly split to training and testing datasets.
Model Design & Selection
This is where you showcase your theoretical knowledge of machine learning models:
- Benchmark — Start with a simple “stupid” model or heuristic and then slowly build complexity.
- Training — Cross-validation, hyperparameter tuning, early stopping.
- Tradeoffs — Consider tradeoffs like training speed, inference speed, latency, and interpretability.
Service & Deployment
Understanding the best way to serve and deploy the model in production.
- Infrastructure — Choose cloud/on-prem, set up CI/CD pipelines, and ensure scalability.
- Service — API endpoint, edge model, batch predictions vs online predictions.
Evaluation & Monitoring
The last part is setting up systems and frameworks to track your model in the production environment.
- Metrics — What metrics to track with the “online” model vs “offline” model.
- Monitoring — Setup a dashboard, monitoring notebook, Slack alerts.
- Experiment — Design an A/B experiment.
What To Learn?
Let me tell you a secret: machine learning system design is not an entry-level interview or skill set.
This is because machine learning system design is tested at the mid and above levels.
By that time, you will have solid knowledge across machine learning and software engineering, and will likely be developing a specialism.
Nevertheless, if you want a comprehensive, but by no means exhaustive list, this is what you need to learn.
Machine Learning Theory
- Supervised learning — Classification (logistic regression, support vector machines, decision trees), regression (linear regression, decision trees. gradient boosted trees).
- Unsupervised learning — Clustering (k-means, DBSCAN), dimensionality reduction, latent semantic analysis.
- Deep learning — Neural networks, convolutional neural networks and recurrent neural networks.
- Loss functions — Accuracy, F1-score, NDCG, precision/recall, RMSE etc.
- Feature selection — How to identify essential features, like correlation analysis, recursive feature elimination, regularisation, cross-validation and hyperparameter tuning.
- Statistics — Bayesian statistics, hypothesis testing and A/B tests.
- Specialisms — Time series, computer vision, operations research, recommendation systems. natural language processing etc. Only need 1–2.
System Design & Engineering
- Cloud — The Main one is AWS, and you should know S3, EC2, Lambda functions, and ECS. Most things are simply wrappers of storage and compute anyway.
- Containerization — Docker and Kubernetes.
- System design — Caching, networking, quantisation, APIs and storage.
- Version control — CircleCI, Jenkins, git, MLflow, Datadog, Weights and Biases.
- Deployment and orchestration frameworks — Argo, Metaflow, Databricks, Airflow and Kubeflow.
Resources
ML System Design Interviews
I plan to release a more detailed video on the machine learning system design interview process later, but for now, I’d like to provide you with a high-level overview along with some tips to help you prepare.
Machine learning system design interviews are typically aimed at mid-level and senior machine learning engineers. In these interviews, you’ll usually be presented with a broad, open-ended problem like designing a recommender system or a spam filter.
If your role involves a particular specialisation, such as computer vision, the interview question will often focus on that specific domain.
One of the biggest challenges with machine learning system design interviews is their lack of standardisation. Unlike software engineering interviews, which follow a relatively consistent format, ML design interviews vary widely in structure. There’s also a lot to cover: countless concepts, trade-offs, and potential solution paths.
That said, most hiring managers tend to evaluate candidates on a few key dimensions:
- Problem translation — Can you take a business problem and frame it as a machine learning solution?
- Decision-making — Do you recognise trade-offs and justify your design choices logically?
- Breadth and depth — Do you demonstrate a solid understanding of ML theory, a variety of models, and how to apply them effectively in real-world scenarios?
How To Prepare For Interviews
In terms of preparations, there is one key thing I recommend.
Work through past problems.
Here are some resources to find such problems:
I also recommend checking out large tech companies’ blog posts to learn more about how machine learning algorithms are deployed at scale:
Earlier, I discussed how system design interviews test more than just your modelling skills.
But what are the underlying fundamentals they’re really testing for?
That’s precisely what I cover in one of my previous articles, which will walk you through everything you need to know, along with the best resources.
The Ultimate AI/ML Roadmap For Beginners
Another Thing!
I offer 1:1 coaching calls where we can chat about whatever you need — whether it’s projects, career advice, or just figuring out your next step. I’m here to help you move forward!
1:1 Mentoring Call with Egor Howell
Career guidance, job advice, project help, resume reviewtopmate.io
Connect With Me
Source link
#Stop #Feeling #Lost #Master #System #Design