Lesson 2 of 2116
The Full Machine Learning Pipeline
From raw bytes to deployed model, every ML system follows the same ten-stage pipeline. Master it and you can read any architecture paper.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Ten Stages, One Pipeline
- 2ML pipeline
- 3preprocessing
- 4training
Concept cluster
Terms to connect while reading
Section 1
Ten Stages, One Pipeline
Nearly every production ML system, from spam filters to LLMs, flows through the same skeleton. Knowing the skeleton lets you place any new paper or product into context quickly.
The stages
- 1Data collection
- 2Data cleaning and labeling
- 3Feature engineering or tokenization
- 4Train/validation/test split
- 5Model architecture selection
- 6Training with a loss function and optimizer
- 7Evaluation on held-out data
- 8Fine-tuning or post-training alignment
- 9Deployment to inference infrastructure
- 10Monitoring, feedback collection, and retraining
Where time actually goes
Newcomers assume training is the main event. In practice, data prep and evaluation consume 70 to 90 percent of engineering time. Training is usually a well-defined, scriptable step. Cleaning messy real-world data is not.
Compare the options
| Stage | Typical time share |
|---|---|
| Data collection and cleaning | 30-50% |
| Feature/token engineering | 10-20% |
| Model training | 5-15% |
| Evaluation and iteration | 20-30% |
| Deployment and monitoring | 10-20% |
Concrete example: fine-tuning for a legal use case
A minimal fine-tuning loop using the Hugging Face stack.
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
dataset = load_dataset("legal_memos", split="train")
tok = AutoTokenizer.from_pretrained("base-model")
model = AutoModelForCausalLM.from_pretrained("base-model")
def preprocess(ex):
return tok(ex["text"], truncation=True, max_length=2048)
ds = dataset.map(preprocess, batched=True)
args = TrainingArguments(
output_dir="./legal-tuned",
per_device_train_batch_size=4,
num_train_epochs=3,
learning_rate=2e-5,
evaluation_strategy="epoch",
)
trainer = Trainer(model=model, args=args, train_dataset=ds)
trainer.train()Train vs. inference infrastructure
- Training: massive parallel GPUs, weeks of runtime, measured in FLOPs
- Inference: latency-sensitive, often needs quantization or distillation
- Fine-tuning: smaller GPUs, hours to days, often with LoRA adapters
- Monitoring: logging, drift detection, A/B tests in production
Common pipeline failures
- Train/serve skew: features built differently in training vs. production
- Data leakage: test data accidentally appears in training
- Distribution drift: real inputs change over time, model decays
- Unmonitored bias: groups see worse outcomes and nobody notices
“Production ML is 5 percent machine learning and 95 percent engineering.”
Key terms in this lesson
The big idea: the ML pipeline is the real substrate of AI products. Papers describe stages 5 and 6. Careers are built on stages 1, 2, 7, 9, and 10.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “The Full Machine Learning Pipeline”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
The Three Ingredients: Data, Compute, Algorithms (Capstone)
Every AI breakthrough of the past decade rests on three interacting ingredients. Synthesize everything you have learned into one working model.
Creators · 45 min
Emergence, Capability Forecasting, and Safety
Emergent abilities make AI both more exciting and more dangerous. How do labs forecast what the next model will do — and what happens when they are wrong?
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
