The Full Machine Learning Pipeline

Section 1

Ten Stages, One Pipeline

Compare the options

Stage	Typical time share
Data collection and cleaning	30-50%
Feature/token engineering	10-20%
Model training	5-15%
Evaluation and iteration	20-30%
Deployment and monitoring	10-20%

A minimal fine-tuning loop using the Hugging Face stack.

python

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

dataset = load_dataset("legal_memos", split="train")
tok = AutoTokenizer.from_pretrained("base-model")
model = AutoModelForCausalLM.from_pretrained("base-model")

def preprocess(ex):
    return tok(ex["text"], truncation=True, max_length=2048)

ds = dataset.map(preprocess, batched=True)

args = TrainingArguments(
    output_dir="./legal-tuned",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    learning_rate=2e-5,
    evaluation_strategy="epoch",
)

trainer = Trainer(model=model, args=args, train_dataset=ds)
trainer.train()

Key terms in this lesson

The Full Machine Learning Pipeline

Ten Stages, One Pipeline

The stages

Where time actually goes

Concrete example: fine-tuning for a legal use case

Train vs. inference infrastructure

Common pipeline failures

Curious about “The Full Machine Learning Pipeline”?

Keep going

The Full Machine Learning Pipeline

Ten Stages, One Pipeline

The stages

Where time actually goes

Concrete example: fine-tuning for a legal use case

Train vs. inference infrastructure

Common pipeline failures

Curious about “The Full Machine Learning Pipeline”?

Keep going