Skip to main content

neural-forge.io

Learn Tracks Models AI Explorer Compare

Sign inStartStart learning

Tendril

Tendril neural-forge.io

Free AI literacy for everyone, supported by trust-safe partners.

Learn

Curriculum
Tracks
For you
Preferences

Resources

Glossary
In the Wild
Newsroom
Community
Partners
Send Feedback
Changelog
About
New to AI?

Schools & Orgs

Schools
Libraries
Tech Teams
Free Access
Sponsor
Sign Up
Support the Mission

Trust

Privacy
Terms
COPPA
Accessibility

Legal

Privacy
Terms
COPPA
Accessibility

© 2026 Tendril·Privacy·Terms·Contact

Built with Claude

Loading lesson…

Tendril

Model Families0%

Time on lesson

0s

← Model Families

0 of 272 complete

○Lesson 513Claude Opus 4.7 — when extended thinking earns its cost
○Lesson 532Grok Vision — visual reasoning on the third option
○Lesson 541Qwen 3 VL — vision specialist
○Lesson 544Kimi Research Mode — autonomous deep research
○Lesson 545Flux Schnell vs. Flux Pro
○Lesson 546Flux Dev — open-source fine-tuning
○Lesson 547Midjourney niji — anime mode
○Lesson 548SDXL Turbo — real-time generation
○Lesson 549ElevenLabs v3 — voice cloning use cases
○Lesson 1206OpenAI Model Picker: GPT-5.5, GPT-5.4, Mini, Nano, and Codex
○Lesson 1305ChatGPT For Everyday Work: Plus vs Pro vs Team vs Enterprise
○Lesson 1306Building A Custom GPT For A Specific Workflow
○Lesson 1307The GPT Store: Discovery, Monetization, And Quality Signals
○Lesson 1308ChatGPT Memory: When To Enable, When To Turn It Off
○Lesson 1309ChatGPT Voice Mode: When Voice Beats Typing
○Lesson 1310Code Interpreter / Advanced Data Analysis: What It Can And Can't Do
○Lesson 1311Operator: The Agentic Browser Pattern
○Lesson 1312Sora: Video Generation Prompts And Their Limits
○Lesson 1313Atlas Browser: Agent-First Browsing Workflows
○Lesson 1314ChatGPT Projects: Organizing Long-Running Work
○Lesson 1315Custom Instructions: The System-Prompt Layer Most Users Never Touch
○Lesson 1316ChatGPT For Research: Connectors And Document Q&A
○Lesson 1317ChatGPT Vision: When To Upload An Image Vs Describe It
○Lesson 1318Bulk Processing In ChatGPT: Patterns For Repeated Tasks
○Lesson 1319Prompt-Injection Risks Specific To ChatGPT Plugins And Connectors
○Lesson 1320Sharing Chats Vs Sharing GPTs: What Leaks And What Doesn't
○Lesson 1321ChatGPT Vs API: When To Graduate To Direct API Use
○Lesson 1322ChatGPT Enterprise Data Controls: What An Admin Actually Controls
○Lesson 1323Switching Between OpenAI Models Inside ChatGPT: When Each Makes Sense
○Lesson 1324Migrating Workflows From ChatGPT To Other Tools: What Survives, What Breaks
○Lesson 1325What Hermes Is And How It Differs From Base Llama
○Lesson 1326Hermes 3 Vs Hermes 2 Pro: When To Upgrade
○Lesson 1327Running Hermes Locally With Ollama / LM Studio
○Lesson 1328Hermes For Function Calling: Tool-Use Without OpenAI
○Lesson 1329Hermes For Structured JSON Output: Schemas That Work
○Lesson 1330Hermes Vs Vanilla Llama For Chat: Measuring The Gap
○Lesson 1331Fine-Tuning Hermes For A Specific Domain
○Lesson 1332Hermes Context Window And Long-Document Strategies
○Lesson 1333Quantization Tradeoffs (Q4 Vs Q8) For Hermes
○Lesson 1334Hermes On A Mac: Apple Silicon Performance Notes
○Lesson 1335Hermes For Cost-Sensitive Production Workloads
○Lesson 1336System Prompts That Work For Hermes
○Lesson 1337Hermes For Code Completion Vs Claude Sonnet: Honest Comparison
○Lesson 1338Hermes Safety And Jailbreak Resistance: What To Know
○Lesson 1339Building A Private Chatbot On Hermes
○Lesson 1340Hermes Via OpenRouter: The Cloud-Hosted Shortcut
○Lesson 1341Hermes For Offline / Air-Gapped Environments
○Lesson 1342Migrating Prompts From Claude/GPT To Hermes: Gotchas
○Lesson 1343Hermes Evaluation: How To Benchmark On Your Own Task
○Lesson 1344When To Choose Hermes Over A Frontier Model: The Decision Framework
○Lesson 1405What 'Frontier Model' Means — And Why The Line Keeps Moving
○Lesson 1406Frontier Capabilities Matrix: Long Context, Reasoning, Vision, Audio, Tools
○Lesson 1407Reading Benchmark Cards Critically
○Lesson 1408The Reasoning-Model Family: When To Pay Extra For Thinking
○Lesson 1409Multimodal Frontier: When Vision And Audio Actually Move The Needle
○Lesson 1410Frontier Latency And Streaming Patterns
○Lesson 1411Frontier Cost Optimization: Caching, Compression, And Fallback
○Lesson 1412Safety Classifiers And Refusals On Frontier Models
○Lesson 1413Switching Costs: Migrating Between Frontier Vendors
○Lesson 1414The Ceiling: Where Frontier Models Still Fail In 2026
○Lesson 1415Who MiniMax Is And What They Ship
○Lesson 1416ABAB Chat Models vs Western Frontier — Honest Comparison
○Lesson 1417Hailuo Video: What Makes It Stand Out
○Lesson 1418MiniMax For Long-Context Tasks
○Lesson 1419MiniMax Pricing And Access — Using Them Outside China
○Lesson 1420MiniMax For Agentic Tasks: Strengths And Gaps
○Lesson 1421MiniMax Safety And Refusal Behavior
○Lesson 1422Building A Multilingual Product On MiniMax
○Lesson 1423Switching Prompts From GPT/Claude To ABAB — Gotchas
○Lesson 1424When MiniMax Is The Right Choice vs Western Alternatives
○Lesson 1425Moonshot AI and Kimi: Meeting the Long-Context Specialist From Beijing
○Lesson 1426Kimi K1, K2, and the Long-Context Architecture
○Lesson 1427Kimi for Document Analysis: The Million-Token Use Case
○Lesson 1428Pricing and Access: Using Kimi From Outside China
○Lesson 1429Kimi vs Claude Sonnet for Long Context: An Honest Comparison
○Lesson 1430Kimi Safety and Refusal Patterns: What It Will and Will Not Do
○Lesson 1431Kimi as an Agent: Browsing, Tools, and Multi-Step Tasks
○Lesson 1432Multilingual Prompting on Kimi: Chinese-First, Globally Capable
○Lesson 1433Migrating Long-Context Workflows From Claude or Gemini to Kimi
○Lesson 1434When to Pick Kimi vs Western Alternatives: A Decision Framework
○Lesson 1435Why Run Local LLMs: Privacy, Cost, Latency, and Control
○Lesson 1436Ollama: The Easy On-Ramp to Local Models
○Lesson 1437LM Studio: The GUI Alternative to Ollama
○Lesson 1438llama.cpp: The Engine Underneath Almost Everything
○Lesson 1439Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities
○Lesson 1440Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision
○Lesson 1441Choosing a Local Model: Llama, Mistral, Hermes, Qwen, DeepSeek, and Friends
○Lesson 1442Local RAG With Ollama and a Vector DB: A Self-Contained Pipeline
○Lesson 1443Local Function Calling and Structured Output: Making Small Models Reliable
○Lesson 1444When Local LLMs Make Sense vs Cloud: The Decision Framework
○Lesson 1640Local Model Family: Qwen
○Lesson 1641Local Qwen Coder: Build a Private Coding Assistant
○Lesson 1642Local Qwen-VL: Seeing Images Without a Cloud API
○Lesson 1643Qwen Thinking Modes: Speed Versus Deliberation
○Lesson 1645Ministral and Small Mistral Models for Edge Work
○Lesson 1646Mixtral and MoE: Many Experts, Fewer Active Weights
○Lesson 1647Codestral and Devstral: Mistral Models for Code Work
○Lesson 1648Local Model Family: Gemma
○Lesson 1651Local Model Family: Llama
○Lesson 1652Llama Guard and Prompt Guard: Local Safety Models
○Lesson 1654DeepSeek R1 Distills: Reasoning on Local Hardware
○Lesson 1655Local Model Family: Microsoft Phi
○Lesson 1656Phi Multimodal: Tiny Models With Text, Image, and Audio Jobs
○Lesson 1657Local Model Family: IBM Granite
○Lesson 1658Granite Code: Local Enterprise Coding Workflows
○Lesson 1659Local Model Family: NVIDIA Nemotron
○Lesson 1660Command R: Local Retrieval and Tool-Use Thinking
○Lesson 1661Local Model Family: GLM
○Lesson 1662MiniCPM: Ultra-Efficient Models for End Devices
○Lesson 1663SmolLM: Tiny Models That Teach the Limits Clearly
○Lesson 1664StarCoder2: Open Code Models for Local Programming Lessons
○Lesson 1665Local Model Family: Falcon
○Lesson 1667Local Model Family: OLMo
○Lesson 1668Local Embedding Models: BGE, Nomic, E5, and GTE
○Lesson 1669Local Rerankers and Model Routers: The Small Models Around the Big Model
○Lesson 1670Ollama Modelfiles: Turn a Base Model Into a Local Assistant
○Lesson 1671LM Studio Server: Local Models Behind an API
○Lesson 1673MLX on Apple Silicon: Local Models for Macs
○Lesson 1674vLLM: Serving Local Models on Serious GPUs
○Lesson 1675Text Generation Inference: Production Serving Concepts
○Lesson 1676llamafile: Portable Local AI in One File
○Lesson 1677OpenAI-Compatible Local APIs: Swap the Base URL
○Lesson 1678Quantization Choices: FP16, Q8, Q6, Q5, and Q4
○Lesson 1679Context Windows and KV Cache: Why Long Prompts Eat Memory
○Lesson 1680VRAM and RAM Sizing: What Can This Machine Actually Run?
○Lesson 1681CPU-Only Local Models: Slow Can Still Be Useful
○Lesson 1682Apple Unified Memory: Why Macs Feel Different for Local AI
○Lesson 1683NVIDIA Workstations: The Local AI Server Pattern
○Lesson 1684Download Hygiene: Model Provenance, Licenses, and Checksums
○Lesson 1685Chat Templates: Why the Same Prompt Behaves Differently
○Lesson 1686Function Calling With Local Models: Harness First, Model Second
○Lesson 1687Structured Output: JSON, Grammars, and Repair Loops
○Lesson 1688Local RAG Chunking: The Retrieval Layer Starts With Text Splits
○Lesson 1689Local Vector Stores: Search Without Sending Documents Away
○Lesson 1690Embedding Evals: Measure Retrieval Before the Chat Model
○Lesson 1691Reranker Evals: The Second Look at Evidence
○Lesson 1692Local Safety Guardrails: Classifiers Around the Main Model
○Lesson 1693Prompt-Injection Tests for Local Agents
○Lesson 1694Build a Local Model Eval Harness
○Lesson 1695Hallucination Hunts for Local Models
○Lesson 1696Latency Benchmarks: TTFT, Tokens per Second, and User Feel
○Lesson 1697Caching Strategies: Reuse Work in Local AI Apps
○Lesson 1698LoRA and Fine-Tuning: When Prompting Is Not Enough
○Lesson 1699Package a Local Model App: From Demo to Usable Tool
○Lesson 23500Claude vs ChatGPT in 2026: Which One for What Job
○Lesson 23501Where Gemini Wins: Use Cases Where Google's Model Family Has the Edge
○Lesson 23502Open-Source vs Frontier Models: The Production Decision
○Lesson 23503When to Fine-Tune vs When to Just Prompt: A Decision Framework
○Lesson 23504AI Token Cost Optimization: From Pilot to Production Without Sticker Shock
○Lesson 25900Claude Projects: When the Persistent Workspace Pays Off
○Lesson 25901Custom GPTs in ChatGPT: When and How to Build
○Lesson 25902When to Use the API vs the Chatbot Interface
○Lesson 25903Context Window Strategy: When You Have Millions of Tokens
○Lesson 25904Vendor Redundancy for AI: When One Vendor Goes Down
○Lesson 27000Cost, Quality, Latency Trade-offs in Model Selection
○Lesson 27001AI Vendor Region Selection: Latency, Compliance, Resilience
○Lesson 27002On-Device AI vs Cloud AI: When Each Wins
○Lesson 27003Vendor Pricing Changes: How They Affect Production AI
○Lesson 27004Tokenizer Quirks That Affect Cost and Quality
○Lesson 27900Self-Hosted AI: When the Trade-offs Pay Off
○Lesson 27901AI Vendor Lock-In: Patterns and Mitigations
○Lesson 27902AI on Edge Devices: When and How
○Lesson 27903Multimodal AI Trade-offs: Vision, Audio, Video
○Lesson 27904Streaming vs Batch AI Inference: Architecture Choice
○Lesson 29000Domain-Specific AI Models: When General Models Don't Cut It
○Lesson 29002Model Distillation: Smaller Models Trained From Larger
○Lesson 29003Smart Model Routing: Right Model for Right Task
○Lesson 29004Response Streaming: User Experience for AI Latency
○Lesson 29600Tracking Model Versions Across Vendors
○Lesson 29601Building Comprehensive Model Evaluation Suites
○Lesson 29602Reading Public Model Cards Critically
○Lesson 29603Model Warmup: First-Request Latency Mitigation
○Lesson 29604Model Fallback Cascades for Reliability
○Lesson 31200Multi-Agent Framework Comparison
○Lesson 31201Tool Calling Quality Across Frontier Models
○Lesson 31202Vision Model Selection by Use Case
○Lesson 31203Audio Model Selection: Whisper, ElevenLabs, and Beyond
○Lesson 31204Coding Model Selection: Claude, GPT, Codex
○Lesson 32800Frontier vs Open Source Model Selection
○Lesson 32801Context Caching for Cost Optimization
○Lesson 32802Prompt Compression Techniques
○Lesson 32803Batch Processing for Cost Optimization
○Lesson 34200Comparing AI Evaluation Platforms
○Lesson 34201AI Production Monitoring Platforms Compared
○Lesson 34202Model Routing Platforms: Specialized vs General
○Lesson 34203Prompt Management Platforms Compared
○Lesson 36100Claude 4.7 vs. GPT-5: A Practitioner's Comparison for 2026
○Lesson 36101Working With Gemini's 2M-Token Context Window — Real Use Cases
Lesson 36102Small Language Models on Device: Phi, Gemma, Llama 3.2 in Production
○Lesson 36103Mixture-of-Experts Models: What MoE Means for Your Latency and Cost
○Lesson 36104Surviving Model Deprecations: Building Provider-Agnostic AI Apps
○Lesson 36105Reasoning Models (o-series, Claude Extended Thinking, Gemini Deep Think): When the Extra Tokens Are Worth It
○Lesson 36107Audio Model Comparison 2026: Whisper, Voxtral, GPT-Realtime, Gemini Live
○Lesson 36109Open-Source vs. Closed Frontier Models in 2026: Where the Gap Stands
○Lesson 37600AI Model Quantization: 4-bit, 8-bit, FP16 Tradeoffs
○Lesson 37601Speculative Decoding for Faster LLM Inference
○Lesson 37602Mixture-of-Experts Models: Mixtral, DeepSeek, Qwen MoE
○Lesson 37604Base vs. Instruct Models: When to Use Which
○Lesson 37605Context Window Extension Techniques Across Model Families
○Lesson 37606Tool Use Quality Across Claude, GPT, Gemini, Llama
○Lesson 37607Vision-Language Models: Claude, GPT-4o, Gemini, Qwen-VL
○Lesson 37608Embedding Model Selection: OpenAI, Cohere, Voyage, BGE
○Lesson 39100Prompt Caching Comparison: Anthropic, OpenAI, Gemini
○Lesson 39101Output Token Pricing Asymmetry Across Model Families
○Lesson 39102Structured Output Modes: JSON Mode, Schema, Tool Forcing
○Lesson 39103Multimodal Input Pricing: Image, Audio, and Video Tokens
○Lesson 39104Context Attention Quality: Lost-in-the-Middle Across Models
○Lesson 39105Batch API Economics: When 50% Discounts Pay Off
○Lesson 39106Fine-Tuning Cost Curves: When Fine-Tuning Pays Off
○Lesson 39108Rate Limit Tier Progression Across Vendors
○Lesson 39109Tokenizer Cost Differences Across Languages and Code
○Lesson 40600Which Model Families Are Most Agent-Friendly in 2026
○Lesson 40602How Image Input Pricing Varies Across Vendors
○Lesson 40603How Models Implement Instruction Hierarchy in 2026
○Lesson 40604How Model Latency Varies by Region and Vendor
○Lesson 40605Long Context Pricing Tiers Across Vendors
○Lesson 40606Reading Model Card Deltas Between Versions
○Lesson 40607Comparing Output Token Throughput Across Models
○Lesson 40608Tracking Refusal Policy Changes Across Model Updates
○Lesson 40609How Strict Vendors Are About Tool Call Schemas
○Lesson 42500How prompt portability differs between Claude, GPT, and Gemini
○Lesson 42503Function calling strictness modes in Claude, GPT, and Gemini
○Lesson 42504Reasoning-budget tradeoffs across Claude extended thinking and GPT-5
○Lesson 42506Comparing batch inference modes across Anthropic, OpenAI, and Google
○Lesson 42507Comparing safety refusal patterns in Claude, GPT, and Gemini
○Lesson 42509Region and data-residency options across Claude, GPT, and Gemini
○Lesson 44400AI prompt cache strategies across model families
○Lesson 44402AI structured output modes across model families
○Lesson 44403AI vision cost comparison across model families
○Lesson 44405AI context cache pricing across model families
○Lesson 44406AI eval portability across model families
○Lesson 44407AI fallback routing across model families
○Lesson 44409AI token pricing changes across model families
○Lesson 46404AI model families: open-weight vs closed — what actually changes
○Lesson 46407AI model families: instruction-following styles you'll feel
○Lesson 46408AI model families: safety and refusal differences across providers
○Lesson 46409AI model families: roadmap watching without thrash
○Lesson 48400AI Model Families: Pick Among Claude, GPT, and Gemini Without Tribalism
○Lesson 48402AI Model Families: When Small Models (Haiku, Flash, Mini) Are the Right Answer
○Lesson 48403AI Model Families: Reasoning Models (o-series, Thinking modes) and Their Real Workloads
○Lesson 48404AI Model Families: Pick a Vision Model for Your Real Image Workload
○Lesson 48405AI Model Families: Pick an Embedding Model You Can Live With
○Lesson 48406AI Model Families: Pick Speech-to-Text and Text-to-Speech for Latency and Cost
○Lesson 48407AI Model Families: Pick an Image-Generation Model for Your Real Brief
○Lesson 48409AI Model Families: Pin Models, Watch Deprecations, and Plan Migrations
○Lesson 50400AI and frontier vs small model tradeoff
○Lesson 50406AI and embedding model selection
○Lesson 50408AI and model card reading skills
○Lesson 56405Reasoning-Mode Models: When the Extra Latency Is Worth It
○Lesson 56407Temperature and Sampling: What They Control and Don't
○Lesson 56408Reasoning About Cost Per Task, Not Per Token
○Lesson 56409Working With Built-In Safety Classifiers and Refusals
○Lesson 58400AI Model Choice: Claude Haiku vs Sonnet for Creator Workloads
○Lesson 58401AI Reasoning Modes: When to Use GPT-5 Thinking vs Standard
○Lesson 58406AI Image Models: Midjourney vs DALL-E vs Stable Diffusion in Production
○Lesson 58407AI Video Models: Sora, Veo, Runway, and What's Actually Usable
○Lesson 58408AI Voice: ElevenLabs vs OpenAI vs Cartesia for Realtime
○Lesson 58409AI Music: Suno and Udio for Creators Who Aren't Musicians
○Lesson 58410AI Coding Models: Claude Code vs Cursor vs Copilot Differences
○Lesson 58411AI Transcription: Whisper vs Deepgram vs AssemblyAI Tradeoffs
○Lesson 58412AI On-Device: Phi, Gemma, and When Tiny Models Make Sense
○Lesson 58415AI Model Evals: How to Test a New Release in 30 Minutes
○Lesson 58419AI Model Routing: Picking the Right Model Per Request Automatically
○Lesson 58421AI Batch APIs: 50% Off for Async Workloads
○Lesson 58424AI Hybrid Pipelines: Mixing On-Device and Cloud Models in One App
○Lesson 60900AI Model Families: Frontier vs Mid-Tier vs Small — Picking the Right Class
○Lesson 60906AI Model Quantization: 8-bit, 4-bit, and Quality Cliffs
○Lesson 60909AI On-Device Models: Phi, Gemma, and the Edge Tradeoff
○Lesson 60910AI Provider Rate Limits: Designing Around Token-Per-Minute Caps
○Lesson 60911AI Model Leaderboards: What Public Benchmarks Actually Tell You
○Lesson 60912AI Pricing Models: Per-Token, Cached, Batch, and Reserved Capacity
○Lesson 60914AI Model Safety Tuning: How Refusal Behavior Differs Across Vendors

Curriculum
·
Creators
·
Model Families
·
Small Language Models on Device: Phi, Gemma, Llama 3.2 in Production

Lesson 1290 of 2116

Small Language Models on Device: Phi, Gemma, Llama 3.2 in Production

When a 3B-7B model on-device wins over an API call to a frontier model.

CreatorsModel Families~7 min readBI2 · Representation & ReasoningBI3 · LearningBI4 · Natural InteractionPrint / PDF

Big idea

When a 3B-7B model on-device wins over an API call to a frontier model.

Lesson map

What this lesson covers

11 min11 blocks7 concepts

Learning path

The main moves in order

1The premise
2SLM
3on-device
4Phi

Concept cluster

Terms to connect while reading

SLMon-devicePhiGemmaLlamaedge AI

Read1

Sections3

Lists2

Notes4

Terms1

Section 1

The premise

Small models run free, fast, and offline — but they're only enough for narrow, well-scoped tasks.

What AI does well here

Run private text classification offline on user devices
Provide instant autocomplete with no network round-trip
Cut cost to zero for high-volume, low-stakes tasks
Comply with strict data-residency requirements

Workload triage

Route by task: classification/extraction → SLM. Reasoning/generation → frontier API. Measure quality on your eval set in both lanes before committing.

Check-in 1. Got it so far?

What AI cannot do

Compete with frontier models on open-ended reasoning
Handle long context — most are capped at 8-32K tokens
Stay current — they don't learn from new data without re-training

Hybrid is the answer

Most production systems end up hybrid — SLM for fast cheap paths, frontier for the hard ones. Plan the routing layer from day one.

Key terms in this lesson

SLM
on-device
Phi
Gemma
Llama

Check-in 2. Got it so far?

Benchmark before committing

Run your actual task samples against candidate models before choosing. Leaderboard rankings don't predict task-specific performance reliably.

Lesson complete

You've completed "Small Language Models on Device: Phi, Gemma, Llama 3.2 in Production". Mark this lesson done and keep going — every lesson builds on the last.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Small Language Models on Device: Phi, Gemma, Llama 3.2 in Production”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Your question

Try one:

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going

Creators · 11 min
AI On-Device: Phi, Gemma, and When Tiny Models Make Sense
4B-parameter models run on your laptop and phone. They're not GPT-5 — but they're surprisingly useful.
Creators · 17 min
Local Model Family: Microsoft Phi
Phi models show why small language models matter: they are designed for efficient local and edge scenarios, not for winning every frontier benchmark.
Creators · 40 min
Local Model Family: Gemma
Gemma is Google DeepMind open-model family, useful for local and single-accelerator experiments when students want polished small models.

Previous: Working With Gemini's 2M-Token Context Window — Real Use Cases

Mixture-of-Experts Models: What MoE Means for Your Latency and Cost: Next

Report an error

Reading mode