Skip to main content

neural-forge.io

Learn Tracks Models AI Explorer Compare

Sign inStartStart learning

Tendril

Tendril neural-forge.io

Free AI literacy for everyone, supported by trust-safe partners.

Learn

Curriculum
Tracks
For you
Preferences

Resources

Glossary
In the Wild
Newsroom
Community
Partners
Send Feedback
Changelog
About
New to AI?

Schools & Orgs

Schools
Libraries
Tech Teams
Free Access
Sponsor
Sign Up
Support the Mission

Trust

Privacy
Terms
COPPA
Accessibility

Legal

Privacy
Terms
COPPA
Accessibility

© 2026 Tendril·Privacy·Terms·Contact

Built with Claude

Loading lesson…

Tendril

Model Families0%

Time on lesson

0s

← Model Families

0 of 272 complete

○Lesson 513Claude Opus 4.7 — when extended thinking earns its cost
○Lesson 532Grok Vision — visual reasoning on the third option
○Lesson 541Qwen 3 VL — vision specialist
○Lesson 544Kimi Research Mode — autonomous deep research
○Lesson 545Flux Schnell vs. Flux Pro
○Lesson 546Flux Dev — open-source fine-tuning
○Lesson 547Midjourney niji — anime mode
○Lesson 548SDXL Turbo — real-time generation
○Lesson 549ElevenLabs v3 — voice cloning use cases
○Lesson 1206OpenAI Model Picker: GPT-5.5, GPT-5.4, Mini, Nano, and Codex
○Lesson 1305ChatGPT For Everyday Work: Plus vs Pro vs Team vs Enterprise
○Lesson 1306Building A Custom GPT For A Specific Workflow
○Lesson 1307The GPT Store: Discovery, Monetization, And Quality Signals
○Lesson 1308ChatGPT Memory: When To Enable, When To Turn It Off
○Lesson 1309ChatGPT Voice Mode: When Voice Beats Typing
○Lesson 1310Code Interpreter / Advanced Data Analysis: What It Can And Can't Do
○Lesson 1311Operator: The Agentic Browser Pattern
○Lesson 1312Sora: Video Generation Prompts And Their Limits
○Lesson 1313Atlas Browser: Agent-First Browsing Workflows
○Lesson 1314ChatGPT Projects: Organizing Long-Running Work
○Lesson 1315Custom Instructions: The System-Prompt Layer Most Users Never Touch
○Lesson 1316ChatGPT For Research: Connectors And Document Q&A
○Lesson 1317ChatGPT Vision: When To Upload An Image Vs Describe It
○Lesson 1318Bulk Processing In ChatGPT: Patterns For Repeated Tasks
○Lesson 1319Prompt-Injection Risks Specific To ChatGPT Plugins And Connectors
○Lesson 1320Sharing Chats Vs Sharing GPTs: What Leaks And What Doesn't
○Lesson 1321ChatGPT Vs API: When To Graduate To Direct API Use
○Lesson 1322ChatGPT Enterprise Data Controls: What An Admin Actually Controls
○Lesson 1323Switching Between OpenAI Models Inside ChatGPT: When Each Makes Sense
○Lesson 1324Migrating Workflows From ChatGPT To Other Tools: What Survives, What Breaks
○Lesson 1325What Hermes Is And How It Differs From Base Llama
○Lesson 1326Hermes 3 Vs Hermes 2 Pro: When To Upgrade
○Lesson 1327Running Hermes Locally With Ollama / LM Studio
○Lesson 1328Hermes For Function Calling: Tool-Use Without OpenAI
○Lesson 1329Hermes For Structured JSON Output: Schemas That Work
○Lesson 1330Hermes Vs Vanilla Llama For Chat: Measuring The Gap
○Lesson 1331Fine-Tuning Hermes For A Specific Domain
○Lesson 1332Hermes Context Window And Long-Document Strategies
○Lesson 1333Quantization Tradeoffs (Q4 Vs Q8) For Hermes
○Lesson 1334Hermes On A Mac: Apple Silicon Performance Notes
○Lesson 1335Hermes For Cost-Sensitive Production Workloads
○Lesson 1336System Prompts That Work For Hermes
○Lesson 1337Hermes For Code Completion Vs Claude Sonnet: Honest Comparison
○Lesson 1338Hermes Safety And Jailbreak Resistance: What To Know
○Lesson 1339Building A Private Chatbot On Hermes
○Lesson 1340Hermes Via OpenRouter: The Cloud-Hosted Shortcut
○Lesson 1341Hermes For Offline / Air-Gapped Environments
○Lesson 1342Migrating Prompts From Claude/GPT To Hermes: Gotchas
○Lesson 1343Hermes Evaluation: How To Benchmark On Your Own Task
○Lesson 1344When To Choose Hermes Over A Frontier Model: The Decision Framework
○Lesson 1405What 'Frontier Model' Means — And Why The Line Keeps Moving
○Lesson 1406Frontier Capabilities Matrix: Long Context, Reasoning, Vision, Audio, Tools
○Lesson 1407Reading Benchmark Cards Critically
○Lesson 1408The Reasoning-Model Family: When To Pay Extra For Thinking
○Lesson 1409Multimodal Frontier: When Vision And Audio Actually Move The Needle
○Lesson 1410Frontier Latency And Streaming Patterns
○Lesson 1411Frontier Cost Optimization: Caching, Compression, And Fallback
○Lesson 1412Safety Classifiers And Refusals On Frontier Models
○Lesson 1413Switching Costs: Migrating Between Frontier Vendors
○Lesson 1414The Ceiling: Where Frontier Models Still Fail In 2026
○Lesson 1415Who MiniMax Is And What They Ship
○Lesson 1416ABAB Chat Models vs Western Frontier — Honest Comparison
○Lesson 1417Hailuo Video: What Makes It Stand Out
○Lesson 1418MiniMax For Long-Context Tasks
○Lesson 1419MiniMax Pricing And Access — Using Them Outside China
○Lesson 1420MiniMax For Agentic Tasks: Strengths And Gaps
○Lesson 1421MiniMax Safety And Refusal Behavior
○Lesson 1422Building A Multilingual Product On MiniMax
○Lesson 1423Switching Prompts From GPT/Claude To ABAB — Gotchas
○Lesson 1424When MiniMax Is The Right Choice vs Western Alternatives
○Lesson 1425Moonshot AI and Kimi: Meeting the Long-Context Specialist From Beijing
○Lesson 1426Kimi K1, K2, and the Long-Context Architecture
○Lesson 1427Kimi for Document Analysis: The Million-Token Use Case
○Lesson 1428Pricing and Access: Using Kimi From Outside China
○Lesson 1429Kimi vs Claude Sonnet for Long Context: An Honest Comparison
○Lesson 1430Kimi Safety and Refusal Patterns: What It Will and Will Not Do
○Lesson 1431Kimi as an Agent: Browsing, Tools, and Multi-Step Tasks
○Lesson 1432Multilingual Prompting on Kimi: Chinese-First, Globally Capable
○Lesson 1433Migrating Long-Context Workflows From Claude or Gemini to Kimi
○Lesson 1434When to Pick Kimi vs Western Alternatives: A Decision Framework
○Lesson 1435Why Run Local LLMs: Privacy, Cost, Latency, and Control
○Lesson 1436Ollama: The Easy On-Ramp to Local Models
○Lesson 1437LM Studio: The GUI Alternative to Ollama
○Lesson 1438llama.cpp: The Engine Underneath Almost Everything
○Lesson 1439Hardware Sizing for Local Models: VRAM, Unified Memory, and CPU-Only Realities
○Lesson 1440Quantization Explained: GGUF, AWQ, GPTQ, and the Q4 vs Q8 vs FP16 Decision
○Lesson 1441Choosing a Local Model: Llama, Mistral, Hermes, Qwen, DeepSeek, and Friends
○Lesson 1442Local RAG With Ollama and a Vector DB: A Self-Contained Pipeline
○Lesson 1443Local Function Calling and Structured Output: Making Small Models Reliable
○Lesson 1444When Local LLMs Make Sense vs Cloud: The Decision Framework
○Lesson 1640Local Model Family: Qwen
○Lesson 1641Local Qwen Coder: Build a Private Coding Assistant
○Lesson 1642Local Qwen-VL: Seeing Images Without a Cloud API
○Lesson 1643Qwen Thinking Modes: Speed Versus Deliberation
○Lesson 1645Ministral and Small Mistral Models for Edge Work
○Lesson 1646Mixtral and MoE: Many Experts, Fewer Active Weights
○Lesson 1647Codestral and Devstral: Mistral Models for Code Work
○Lesson 1648Local Model Family: Gemma
○Lesson 1651Local Model Family: Llama
○Lesson 1652Llama Guard and Prompt Guard: Local Safety Models
○Lesson 1654DeepSeek R1 Distills: Reasoning on Local Hardware
○Lesson 1655Local Model Family: Microsoft Phi
○Lesson 1656Phi Multimodal: Tiny Models With Text, Image, and Audio Jobs
○Lesson 1657Local Model Family: IBM Granite
○Lesson 1658Granite Code: Local Enterprise Coding Workflows
○Lesson 1659Local Model Family: NVIDIA Nemotron
○Lesson 1660Command R: Local Retrieval and Tool-Use Thinking
○Lesson 1661Local Model Family: GLM
○Lesson 1662MiniCPM: Ultra-Efficient Models for End Devices
○Lesson 1663SmolLM: Tiny Models That Teach the Limits Clearly
○Lesson 1664StarCoder2: Open Code Models for Local Programming Lessons
○Lesson 1665Local Model Family: Falcon
○Lesson 1667Local Model Family: OLMo
○Lesson 1668Local Embedding Models: BGE, Nomic, E5, and GTE
○Lesson 1669Local Rerankers and Model Routers: The Small Models Around the Big Model
○Lesson 1670Ollama Modelfiles: Turn a Base Model Into a Local Assistant
○Lesson 1671LM Studio Server: Local Models Behind an API
○Lesson 1673MLX on Apple Silicon: Local Models for Macs
○Lesson 1674vLLM: Serving Local Models on Serious GPUs
○Lesson 1675Text Generation Inference: Production Serving Concepts
○Lesson 1676llamafile: Portable Local AI in One File
○Lesson 1677OpenAI-Compatible Local APIs: Swap the Base URL
○Lesson 1678Quantization Choices: FP16, Q8, Q6, Q5, and Q4
○Lesson 1679Context Windows and KV Cache: Why Long Prompts Eat Memory
○Lesson 1680VRAM and RAM Sizing: What Can This Machine Actually Run?
○Lesson 1681CPU-Only Local Models: Slow Can Still Be Useful
○Lesson 1682Apple Unified Memory: Why Macs Feel Different for Local AI
○Lesson 1683NVIDIA Workstations: The Local AI Server Pattern
○Lesson 1684Download Hygiene: Model Provenance, Licenses, and Checksums
○Lesson 1685Chat Templates: Why the Same Prompt Behaves Differently
○Lesson 1686Function Calling With Local Models: Harness First, Model Second
○Lesson 1687Structured Output: JSON, Grammars, and Repair Loops
○Lesson 1688Local RAG Chunking: The Retrieval Layer Starts With Text Splits
○Lesson 1689Local Vector Stores: Search Without Sending Documents Away
○Lesson 1690Embedding Evals: Measure Retrieval Before the Chat Model
○Lesson 1691Reranker Evals: The Second Look at Evidence
○Lesson 1692Local Safety Guardrails: Classifiers Around the Main Model
○Lesson 1693Prompt-Injection Tests for Local Agents
○Lesson 1694Build a Local Model Eval Harness
○Lesson 1695Hallucination Hunts for Local Models
○Lesson 1696Latency Benchmarks: TTFT, Tokens per Second, and User Feel
○Lesson 1697Caching Strategies: Reuse Work in Local AI Apps
○Lesson 1698LoRA and Fine-Tuning: When Prompting Is Not Enough
○Lesson 1699Package a Local Model App: From Demo to Usable Tool
○Lesson 23500Claude vs ChatGPT in 2026: Which One for What Job
○Lesson 23501Where Gemini Wins: Use Cases Where Google's Model Family Has the Edge
○Lesson 23502Open-Source vs Frontier Models: The Production Decision
○Lesson 23503When to Fine-Tune vs When to Just Prompt: A Decision Framework
○Lesson 23504AI Token Cost Optimization: From Pilot to Production Without Sticker Shock
○Lesson 25900Claude Projects: When the Persistent Workspace Pays Off
○Lesson 25901Custom GPTs in ChatGPT: When and How to Build
○Lesson 25902When to Use the API vs the Chatbot Interface
○Lesson 25903Context Window Strategy: When You Have Millions of Tokens
○Lesson 25904Vendor Redundancy for AI: When One Vendor Goes Down
○Lesson 27000Cost, Quality, Latency Trade-offs in Model Selection
○Lesson 27001AI Vendor Region Selection: Latency, Compliance, Resilience
○Lesson 27002On-Device AI vs Cloud AI: When Each Wins
○Lesson 27003Vendor Pricing Changes: How They Affect Production AI
○Lesson 27004Tokenizer Quirks That Affect Cost and Quality
○Lesson 27900Self-Hosted AI: When the Trade-offs Pay Off
○Lesson 27901AI Vendor Lock-In: Patterns and Mitigations
○Lesson 27902AI on Edge Devices: When and How
○Lesson 27903Multimodal AI Trade-offs: Vision, Audio, Video
○Lesson 27904Streaming vs Batch AI Inference: Architecture Choice
○Lesson 29000Domain-Specific AI Models: When General Models Don't Cut It
○Lesson 29002Model Distillation: Smaller Models Trained From Larger
○Lesson 29003Smart Model Routing: Right Model for Right Task
○Lesson 29004Response Streaming: User Experience for AI Latency
○Lesson 29600Tracking Model Versions Across Vendors
○Lesson 29601Building Comprehensive Model Evaluation Suites
○Lesson 29602Reading Public Model Cards Critically
○Lesson 29603Model Warmup: First-Request Latency Mitigation
○Lesson 29604Model Fallback Cascades for Reliability
○Lesson 31200Multi-Agent Framework Comparison
○Lesson 31201Tool Calling Quality Across Frontier Models
○Lesson 31202Vision Model Selection by Use Case
○Lesson 31203Audio Model Selection: Whisper, ElevenLabs, and Beyond
○Lesson 31204Coding Model Selection: Claude, GPT, Codex
○Lesson 32800Frontier vs Open Source Model Selection
○Lesson 32801Context Caching for Cost Optimization
○Lesson 32802Prompt Compression Techniques
○Lesson 32803Batch Processing for Cost Optimization
○Lesson 34200Comparing AI Evaluation Platforms
○Lesson 34201AI Production Monitoring Platforms Compared
○Lesson 34202Model Routing Platforms: Specialized vs General
○Lesson 34203Prompt Management Platforms Compared
○Lesson 36100Claude 4.7 vs. GPT-5: A Practitioner's Comparison for 2026
○Lesson 36101Working With Gemini's 2M-Token Context Window — Real Use Cases
○Lesson 36102Small Language Models on Device: Phi, Gemma, Llama 3.2 in Production
○Lesson 36103Mixture-of-Experts Models: What MoE Means for Your Latency and Cost
○Lesson 36104Surviving Model Deprecations: Building Provider-Agnostic AI Apps
○Lesson 36105Reasoning Models (o-series, Claude Extended Thinking, Gemini Deep Think): When the Extra Tokens Are Worth It
○Lesson 36107Audio Model Comparison 2026: Whisper, Voxtral, GPT-Realtime, Gemini Live
○Lesson 36109Open-Source vs. Closed Frontier Models in 2026: Where the Gap Stands
○Lesson 37600AI Model Quantization: 4-bit, 8-bit, FP16 Tradeoffs
○Lesson 37601Speculative Decoding for Faster LLM Inference
Lesson 37602Mixture-of-Experts Models: Mixtral, DeepSeek, Qwen MoE
○Lesson 37604Base vs. Instruct Models: When to Use Which
○Lesson 37605Context Window Extension Techniques Across Model Families
○Lesson 37606Tool Use Quality Across Claude, GPT, Gemini, Llama
○Lesson 37607Vision-Language Models: Claude, GPT-4o, Gemini, Qwen-VL
○Lesson 37608Embedding Model Selection: OpenAI, Cohere, Voyage, BGE
○Lesson 39100Prompt Caching Comparison: Anthropic, OpenAI, Gemini
○Lesson 39101Output Token Pricing Asymmetry Across Model Families
○Lesson 39102Structured Output Modes: JSON Mode, Schema, Tool Forcing
○Lesson 39103Multimodal Input Pricing: Image, Audio, and Video Tokens
○Lesson 39104Context Attention Quality: Lost-in-the-Middle Across Models
○Lesson 39105Batch API Economics: When 50% Discounts Pay Off
○Lesson 39106Fine-Tuning Cost Curves: When Fine-Tuning Pays Off
○Lesson 39108Rate Limit Tier Progression Across Vendors
○Lesson 39109Tokenizer Cost Differences Across Languages and Code
○Lesson 40600Which Model Families Are Most Agent-Friendly in 2026
○Lesson 40602How Image Input Pricing Varies Across Vendors
○Lesson 40603How Models Implement Instruction Hierarchy in 2026
○Lesson 40604How Model Latency Varies by Region and Vendor
○Lesson 40605Long Context Pricing Tiers Across Vendors
○Lesson 40606Reading Model Card Deltas Between Versions
○Lesson 40607Comparing Output Token Throughput Across Models
○Lesson 40608Tracking Refusal Policy Changes Across Model Updates
○Lesson 40609How Strict Vendors Are About Tool Call Schemas
○Lesson 42500How prompt portability differs between Claude, GPT, and Gemini
○Lesson 42503Function calling strictness modes in Claude, GPT, and Gemini
○Lesson 42504Reasoning-budget tradeoffs across Claude extended thinking and GPT-5
○Lesson 42506Comparing batch inference modes across Anthropic, OpenAI, and Google
○Lesson 42507Comparing safety refusal patterns in Claude, GPT, and Gemini
○Lesson 42509Region and data-residency options across Claude, GPT, and Gemini
○Lesson 44400AI prompt cache strategies across model families
○Lesson 44402AI structured output modes across model families
○Lesson 44403AI vision cost comparison across model families
○Lesson 44405AI context cache pricing across model families
○Lesson 44406AI eval portability across model families
○Lesson 44407AI fallback routing across model families
○Lesson 44409AI token pricing changes across model families
○Lesson 46404AI model families: open-weight vs closed — what actually changes
○Lesson 46407AI model families: instruction-following styles you'll feel
○Lesson 46408AI model families: safety and refusal differences across providers
○Lesson 46409AI model families: roadmap watching without thrash
○Lesson 48400AI Model Families: Pick Among Claude, GPT, and Gemini Without Tribalism
○Lesson 48402AI Model Families: When Small Models (Haiku, Flash, Mini) Are the Right Answer
○Lesson 48403AI Model Families: Reasoning Models (o-series, Thinking modes) and Their Real Workloads
○Lesson 48404AI Model Families: Pick a Vision Model for Your Real Image Workload
○Lesson 48405AI Model Families: Pick an Embedding Model You Can Live With
○Lesson 48406AI Model Families: Pick Speech-to-Text and Text-to-Speech for Latency and Cost
○Lesson 48407AI Model Families: Pick an Image-Generation Model for Your Real Brief
○Lesson 48409AI Model Families: Pin Models, Watch Deprecations, and Plan Migrations
○Lesson 50400AI and frontier vs small model tradeoff
○Lesson 50406AI and embedding model selection
○Lesson 50408AI and model card reading skills
○Lesson 56405Reasoning-Mode Models: When the Extra Latency Is Worth It
○Lesson 56407Temperature and Sampling: What They Control and Don't
○Lesson 56408Reasoning About Cost Per Task, Not Per Token
○Lesson 56409Working With Built-In Safety Classifiers and Refusals
○Lesson 58400AI Model Choice: Claude Haiku vs Sonnet for Creator Workloads
○Lesson 58401AI Reasoning Modes: When to Use GPT-5 Thinking vs Standard
○Lesson 58406AI Image Models: Midjourney vs DALL-E vs Stable Diffusion in Production
○Lesson 58407AI Video Models: Sora, Veo, Runway, and What's Actually Usable
○Lesson 58408AI Voice: ElevenLabs vs OpenAI vs Cartesia for Realtime
○Lesson 58409AI Music: Suno and Udio for Creators Who Aren't Musicians
○Lesson 58410AI Coding Models: Claude Code vs Cursor vs Copilot Differences
○Lesson 58411AI Transcription: Whisper vs Deepgram vs AssemblyAI Tradeoffs
○Lesson 58412AI On-Device: Phi, Gemma, and When Tiny Models Make Sense
○Lesson 58415AI Model Evals: How to Test a New Release in 30 Minutes
○Lesson 58419AI Model Routing: Picking the Right Model Per Request Automatically
○Lesson 58421AI Batch APIs: 50% Off for Async Workloads
○Lesson 58424AI Hybrid Pipelines: Mixing On-Device and Cloud Models in One App
○Lesson 60900AI Model Families: Frontier vs Mid-Tier vs Small — Picking the Right Class
○Lesson 60906AI Model Quantization: 8-bit, 4-bit, and Quality Cliffs
○Lesson 60909AI On-Device Models: Phi, Gemma, and the Edge Tradeoff
○Lesson 60910AI Provider Rate Limits: Designing Around Token-Per-Minute Caps
○Lesson 60911AI Model Leaderboards: What Public Benchmarks Actually Tell You
○Lesson 60912AI Pricing Models: Per-Token, Cached, Batch, and Reserved Capacity
○Lesson 60914AI Model Safety Tuning: How Refusal Behavior Differs Across Vendors

Curriculum
·
Creators
·
Model Families
·
Mixture-of-Experts Models: Mixtral, DeepSeek, Qwen MoE

Lesson 1356 of 2116

Mixture-of-Experts Models: Mixtral, DeepSeek, Qwen MoE

How MoE models work and when they're the right choice for your stack.

CreatorsModel Families~24 min readBI2 · Representation & ReasoningBI3 · LearningBI4 · Natural InteractionPrint / PDF

Big idea

How MoE models work and when they're the right choice for your stack.

Lesson map

What this lesson covers

40 min34 blocks11 concepts

Learning path

The main moves in order

1The premise
2AI and mixture-of-experts cost implications
3The premise
4AI Mixture of Experts: Why Some Models Are Faster Than Their Size

Concept cluster

Terms to connect while reading

mixture of expertssparse activationroutingMoE inferenceactive parametersmemory footprint

Read3

Sections11

Lists6

Notes12

Terms2

Section 1

The premise

MoE models trade memory for compute — high parameter count, low active compute.

What AI does well here

Deliver large-model quality at small-model latency per token.
Scale capacity without proportional compute increase.
Handle diverse tasks via expert routing.

MoE deployment plan

For , plan: GPU memory budget, expert offloading strategy, routing observability, and fallback to a dense model.

Check-in 1. Got it so far?

What AI cannot do

Run cheaply on memory-constrained hardware.
Always beat dense models on reasoning.

MoE ops are non-trivial

Expert imbalance and routing bugs can degrade quality silently. Monitor expert utilization in production.

Key terms in this lesson

mixture of experts
sparse activation
routing
MoE inference

Check-in 2. Got it so far?

Benchmark before committing

Run your actual task samples against candidate models before choosing. Leaderboard rankings don't predict task-specific performance reliably.

Lesson complete

You've completed "Mixture-of-Experts Models: Mixtral, DeepSeek, Qwen MoE". Mark this lesson done and keep going — every lesson builds on the last.

Section 2

AI and mixture-of-experts cost implications

Section 3

The premise

MoE marketing focuses on active parameters. Your bill, GPU memory, and tail latency depend on the full footprint and routing behavior.

Check-in 3. Got it so far?

What AI does well here

Distinguish active vs total parameters.
Estimate memory and latency profile.
Suggest tests for routing instability.

Prompt: MoE evaluation

'Model: MoE, 8x active. Plan: measure tail latency, GPU memory peak, routing stability across 1000 queries. List 3 ops risks.'

What AI cannot do

Predict per-query cost without testing.
Avoid memory headroom needs.
Promise stable routing across versions.

Check-in 4. Got it so far?

Watch out: tail-latency surprises

MoE routing can spike latency unpredictably under load. Test at peak concurrency, not idle.

Benchmark before committing

Run your actual task samples against candidate models before choosing. Leaderboard rankings don't predict task-specific performance reliably.

Lesson complete

You've completed "AI and mixture-of-experts cost implications". Mark this lesson done and keep going — every lesson builds on the last.

Check-in 5. Got it so far?

Section 4

AI Mixture of Experts: Why Some Models Are Faster Than Their Size

Section 5

The premise

Mixture-of-experts architectures route each token to a small subset of specialized 'experts,' so a 600B model can run as cheaply as a 30B dense one.

What AI does well here

Explain why some 'huge' models are cheap to serve
Understand cost-per-token differences across vendors
Compare apparent vs active parameter counts
Inform architecture choices when self-hosting

Try this prompt

Compare two models I'm considering [A vs B]: list parameter count, active parameters, cost/M tokens, and which is likely better for [my workload].

Check-in 6. Got it so far?

What AI cannot do

Make MoE strictly better than dense — there are tradeoffs
Guarantee consistent latency under uneven expert load
Replace good evals with architecture trivia
Tell you which experts activate for your prompt

Watch out

Active parameters explain cost, not quality. A 30B dense model can outperform a 'huge' MoE on specific domains.

Benchmark before committing

Run your actual task samples against candidate models before choosing. Leaderboard rankings don't predict task-specific performance reliably.

Check-in 7. Got it so far?

Lesson complete

You've completed "AI Mixture of Experts: Why Some Models Are Faster Than Their Size". Mark this lesson done and keep going — every lesson builds on the last.

Key terms in this lesson

mixture of experts
sparse activation
routing
MoE inference
active parameters
memory footprint
router
activation
sparsity
inference cost
architecture

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Mixture-of-Experts Models: Mixtral, DeepSeek, Qwen MoE”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Your question

Try one:

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going

Creators · 20 min
Mixtral and MoE: Many Experts, Fewer Active Weights
Mixtral-style mixture-of-experts models teach an important local-model idea: total parameters and active parameters are not the same thing.
Creators · 40 min
Mixture-of-Experts: Why MoE Models Behave Differently
Mixture-of-experts architectures route tokens through specialized sub-networks — and the routing creates eval and serving behaviors single-dense models do not have.
Creators · 20 min
Text Generation Inference: Production Serving Concepts
Hugging Face Text Generation Inference is a useful teaching example for production model serving: router, model server, streaming, and operational controls.

Previous: Speculative Decoding for Faster LLM Inference

Base vs. Instruct Models: When to Use Which: Next

Report an error

Reading mode