Tendril — AI Lessons for Real Life

Tendril

The premise

Self-hosting LLMs trades cost-per-token for ops complexity. The serving framework is a major lever on both.

What AI does well here

Compare serving stacks on: throughput, model coverage, batching.

Map to your traffic shape.

Identify GPU memory ceilings.

What AI cannot do

Replace a load test.

Predict price/perf after a hardware swap.

Substitute for an SRE on call.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-tools-AI-and-self-hosted-LLM-deployment-tools-r9a1-creators

What is the primary trade-off when choosing to self-host an LLM instead of using an API service?

Self-hosting eliminates the need for GPU hardware
Self-hosting automatically improves model accuracy
Self-hosting trades lower cost-per-token for higher operational complexity
Self-hosting reduces latency but increases storage costs

Which of the following is explicitly listed as a criterion for comparing different serving stacks in the material?

Batching support
Community forum size
License cost
GitHub star count

A serving stack's 'model coverage' refers to what?

The number of concurrent users supported
The number of API endpoints available
Which model architectures and sizes the framework can run
The geographic distribution of deployment regions

Why might a single-GPU demonstration of a serving stack be misleading for production decisions?

Multi-GPU and multi-node setups introduce additional complexity and failure modes
Single-GPU demos are too fast to measure accurately
GPU count does not affect KV-cache behavior
Single-GPU demos show the maximum potential performance

Why can AI tools not fully replace actual load testing when evaluating serving stacks?

AI cannot simulate the actual concurrent requests, memory pressure, and hardware behavior of real traffic
Load testing is illegal in most jurisdictions
Load testing provides no useful data
AI has already tested all serving stacks exhaustively

What does 'throughput' measure in the context of LLM serving?

The latency of a single token generation
The total memory used by the model
The number of requests completed per second
The time to generate the first token

What does GPU memory ceiling refer to?

The largest model size that can fit in GPU memory for inference
The maximum number of GPUs you can purchase
The highest price you should pay for GPU rental
The maximum electricity consumption of a GPU

Why would an AI be unable to accurately predict performance after swapping GPU hardware?

Predictions are always accurate
GPU hardware cannot be swapped
AI systems do not understand hardware
Hardware changes affect memory bandwidth, interconnect speed, and CUDA kernel optimization in ways that require empirical testing

In the context of KV-cache, what role does it play in LLM inference?

It encrypts the input prompts
It balances load across multiple servers
It caches previously computed key-value pairs to avoid redundant computation
It stores the compiled model weights

What operational role can an AI not substitute when self-hosting LLMs?

A marketing specialist
A data scientist
An SRE (Site Reliability Engineer) on call
A software developer

Batching in LLM serving is used to:

Split a single request into multiple smaller requests
Process multiple requests together to improve GPU utilization
Encrypt requests for security
Reduce the model size for faster loading

Why might selecting a serving stack based on GitHub stars be problematic?

Stars cause increased latency
GitHub stars measure popularity, not throughput, model coverage, or operational fit for your specific use case
GitHub stars are fake
GitHub stars indicate nothing useful

What should you test at before committing to a serving stack for production?

Your actual concurrency level
Only the demo environment
The license agreement
The marketing materials

vLLM and TGI are examples of what?

LLM models themselves
LLM serving frameworks
GPU hardware vendors
Cloud API providers

A bursty traffic pattern with sudden spikes would most likely stress which aspect of your serving infrastructure?

Hard drive storage capacity
Keyboard response time
GPU memory allocation and KV-cache management
Network cable thickness

The premise

Self-hosting LLMs trades cost-per-token for ops complexity. The serving framework is a major lever on both.

What AI does well here

Compare serving stacks on: throughput, model coverage, batching.

Map to your traffic shape.

Identify GPU memory ceilings.

What AI cannot do

Replace a load test.

Predict price/perf after a hardware swap.

Substitute for an SRE on call.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-tools-AI-and-self-hosted-LLM-deployment-tools-r9a1-creators

What is the primary trade-off when choosing to self-host an LLM instead of using an API service?

Self-hosting eliminates the need for GPU hardware
Self-hosting automatically improves model accuracy
Self-hosting trades lower cost-per-token for higher operational complexity
Self-hosting reduces latency but increases storage costs

Which of the following is explicitly listed as a criterion for comparing different serving stacks in the material?

Batching support
Community forum size
License cost
GitHub star count

A serving stack's 'model coverage' refers to what?

The number of concurrent users supported
The number of API endpoints available
Which model architectures and sizes the framework can run
The geographic distribution of deployment regions

Why might a single-GPU demonstration of a serving stack be misleading for production decisions?

Multi-GPU and multi-node setups introduce additional complexity and failure modes
Single-GPU demos are too fast to measure accurately
GPU count does not affect KV-cache behavior
Single-GPU demos show the maximum potential performance

Why can AI tools not fully replace actual load testing when evaluating serving stacks?

AI cannot simulate the actual concurrent requests, memory pressure, and hardware behavior of real traffic
Load testing is illegal in most jurisdictions
Load testing provides no useful data
AI has already tested all serving stacks exhaustively

What does 'throughput' measure in the context of LLM serving?

The latency of a single token generation
The total memory used by the model
The number of requests completed per second
The time to generate the first token

What does GPU memory ceiling refer to?

The largest model size that can fit in GPU memory for inference
The maximum number of GPUs you can purchase
The highest price you should pay for GPU rental
The maximum electricity consumption of a GPU

Why would an AI be unable to accurately predict performance after swapping GPU hardware?

Predictions are always accurate
GPU hardware cannot be swapped
AI systems do not understand hardware
Hardware changes affect memory bandwidth, interconnect speed, and CUDA kernel optimization in ways that require empirical testing

In the context of KV-cache, what role does it play in LLM inference?

It encrypts the input prompts
It balances load across multiple servers
It caches previously computed key-value pairs to avoid redundant computation
It stores the compiled model weights

What operational role can an AI not substitute when self-hosting LLMs?

A marketing specialist
A data scientist
An SRE (Site Reliability Engineer) on call
A software developer

Batching in LLM serving is used to:

Split a single request into multiple smaller requests
Process multiple requests together to improve GPU utilization
Encrypt requests for security
Reduce the model size for faster loading

Why might selecting a serving stack based on GitHub stars be problematic?

Stars cause increased latency
GitHub stars measure popularity, not throughput, model coverage, or operational fit for your specific use case
GitHub stars are fake
GitHub stars indicate nothing useful

What should you test at before committing to a serving stack for production?

Your actual concurrency level
Only the demo environment
The license agreement
The marketing materials

vLLM and TGI are examples of what?

LLM models themselves
LLM serving frameworks
GPU hardware vendors
Cloud API providers

A bursty traffic pattern with sudden spikes would most likely stress which aspect of your serving infrastructure?

Hard drive storage capacity
Keyboard response time
GPU memory allocation and KV-cache management
Network cable thickness

AI and self-hosted LLM deployment tools

The premise

What AI does well here

What AI cannot do

End-of-lesson check

AI and self-hosted LLM deployment tools

The premise

What AI does well here

What AI cannot do

End-of-lesson check