Tendril

Tendril · Creators · Prompting

Few-Shot Example Curation: Quality, Rotation, and Counter-Examples, Part 1

Chain-of-thought prompts show real performance gains on reasoning tasks — and zero benefit on tasks that don't need reasoning. Here's how to tell which is which.

40 min · Reviewed 2026

The premise

Chain-of-thought is not a universal upgrade; it helps on reasoning-bound tasks and is overhead everywhere else.

What AI does well here

Use CoT on tasks requiring multi-step reasoning (math, complex logic, multi-constraint problems)
Use few-shot CoT examples on reasoning tasks where the structure of reasoning matters
Hide CoT from end-user output when the reasoning isn't user-facing value
Evaluate with and without CoT to confirm benefit on YOUR task

What AI cannot do

Make non-reasoning tasks better with CoT (it just adds tokens)
Make CoT a substitute for fine-tuning on hard reasoning tasks
Trust the reasoning trace as ground truth (models can produce plausible-but-wrong reasoning)

Curating Few-Shot Examples for an LLM Prompt — Quality vs. Quantity

The premise

Few-shot examples teach the model your edge cases and your style — pick them like you're picking a teaching set, not filler.

What AI does well here

Cover the three most common input shapes you actually see
Include one tricky edge case your last release got wrong
Match the exact output format you expect, character for character
Surface the reasoning step if you want the model to externalize one

What AI cannot do

Generalize to a regime far outside the chosen examples
Replace a clear instruction — examples and instructions reinforce each other
Stay relevant when your data distribution shifts

Using Negative Examples in LLM Prompts — When 'Don't Do This' Helps

The premise

Showing the model an explicit wrong answer alongside the right one prevents specific failure modes — but can also seed them if done sloppily.

What AI does well here

Pair every negative example with the corrected version
Label them clearly: 'BAD' and 'GOOD' tagged blocks
Use sparingly — one or two negatives, not ten
Reserve them for failures you have actually observed

What AI cannot do

Substitute for clear positive examples
Stop the model from repeating the bad pattern in adjacent contexts
Generalize 'don't do X' beyond the literal example

Rotating Few-Shot Examples to Prevent Overfitting

The premise

Maintain an example pool larger than what fits in the prompt, and sample N examples per call with a stable hash for reproducibility.

What AI does well here

Reduce mimicry of one phrasing
Surface examples evenly over time
Detect example-set bugs faster

What AI cannot do

Replace evaluation against held-out cases
Compensate for biased pool composition
Guarantee any single output's quality

Prompting AI: few-shot examples that actually transfer

The premise

Few-shot examples teach the model your output shape and your edge-case handling. Examples that all look alike teach only the easy case; the model fails on anything off-distribution.

What AI does well here

Match the format of provided examples in new outputs
Generalize patterns shown across diverse examples
Handle cases similar to ones you demonstrated

What AI cannot do

Generalize from examples that all look the same
Recover gracefully from a case unlike any example
Tell you when an example was a poor choice

AI Prompting: Choose Few-Shot vs Fine-Tune Without Burning a Quarter

The premise

Teams over-invest in fine-tuning when 5-10 strong few-shot examples would solve the task; they also avoid fine-tuning when the cost arithmetic actually favors it.

What AI does well here

Score whether the task is style or knowledge
Estimate prompt-token cost with examples included
Compare against fine-tune training and inference cost
Recommend evals to compare both

What AI cannot do

Account for hidden ops cost of maintaining a fine-tune
Predict whether the model provider will release a better base model
Replace a real eval comparison

AI and few-shot example selection

The premise

Few-shot examples teach the model the shape of the answer. Choosing diverse, edge-leaning examples beats stacking similar ones.

What AI does well here

Suggest covering edge cases in examples.
Help format examples consistently.
Spot when examples contradict each other.

What AI cannot do

Know which examples your model will weight most.
Replace systematic eval.
Guarantee a behavior change from one swap.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prompting-chain-of-thought-creators

A developer tests chain-of-thought prompting on three different tasks: (1) extracting names from emails, (2) solving logic puzzles, and (3) generating a welcome message. On which task would CoT be MOST likely to improve the AI's output quality?
1. Extracting names from emails, because it requires careful parsing
2. Solving logic puzzles, because it requires multi-step reasoning
3. Generating a welcome message, because it requires creativity
4. None of these tasks would benefit from CoT
A student adds step-by-step reasoning to a prompt asking an AI to rewrite a paragraph in simpler language. What is the MOST likely effect?
1. The output will be more concise
2. The AI will be able to explain its rewriting choices better
3. The cost and latency will increase without improving quality
4. The rewritten paragraph will be more accurate
An AI produces a chain-of-thought trace showing each step of its calculation, then gives a final answer. Why should you NOT treat the reasoning trace as ground truth?
1. The model can generate plausible-sounding but incorrect reasoning
2. The trace is never displayed to users
3. The reasoning uses too much memory
4. The trace is generated after the answer, not before
When should you HIDE the chain-of-thought reasoning from the end user?
1. When the user didn't ask for explanation
2. When the reasoning isn't user-facing value and would just add noise
3. When the reasoning contains mathematical symbols
4. When the reasoning is longer than the final answer
A team is evaluating whether chain-of-thought helps their task. They create two prompt versions and measure accuracy on both. What other metrics should they measure?
1. How confident the model sounds
2. Only accuracy, since that's what matters most
3. The length of the reasoning trace
4. Latency and cost, because CoT adds tokens and processing time
A developer wants to use few-shot chain-of-thought examples for a reasoning task. What is the PRIMARY purpose of including these examples?
1. To demonstrate the structure and format of reasoning the task requires
2. To show the model what the correct answer is
3. To make the prompt easier for users to read
4. To reduce the number of tokens in the prompt
An AI fails a complex math problem even when using chain-of-thought prompting. What should the developer try NEXT?
1. Fine-tune the model on math problems
2. Add more CoT examples to the prompt
3. Use a different approach entirely since CoT failed
4. Ask the AI to try again with different wording
What does it mean to 'validate the answer, not the trace'?
1. The answer should match what the model previously said
2. The final answer should be checked against ground truth regardless of how the reasoning looks
3. The reasoning trace should be saved for later analysis
4. The trace should be hidden to prevent bias
A product team implements CoT for their AI assistant. During evaluation, they discover CoT improves accuracy on their test set but doubles the response time. What should they conclude?
1. The trade-off should be analyzed against their latency requirements
2. CoT is beneficial and should be used
3. CoT always helps on reasoning tasks
4. CoT introduces new failure modes
Which scenario represents 'theater' - using CoT where it provides no real benefit?
1. Adding CoT to a prompt that just asks for a yes/no answer
2. Using CoT for math word problems
3. Using CoT to debug code errors
4. Using CoT for a complex scheduling problem with many constraints
When setting up an evaluation to test if CoT helps your specific task, what is required for the test set?
1. A representative sample with known correct answers
2. At least 1,000 examples for statistical significance
3. Examples that the AI has never seen before
4. Only difficult examples to test limits
What failure mode should you specifically look for when evaluating whether CoT introduces new problems?
1. CoT causes the model to refuse more requests
2. The reasoning trace is too long to read
3. The model uses too many tokens
4. CoT produces convincing but incorrect reasoning that leads to wrong answers
An AI assistant uses CoT internally to solve a user's problem but only shows the final answer. Why might this be the RIGHT approach?
1. Chain-of-thought is always hidden to prevent confusion
2. The reasoning is internal work that adds no user value
3. The model shouldn't show its work
4. The user doesn't care about reasoning
What distinguishes a reasoning-bound task from a non-reasoning task?
1. Reasoning tasks require multiple steps or logic to reach an answer
2. Reasoning tasks use more tokens
3. Reasoning tasks are only about math
4. Reasoning tasks are always longer
A developer tests CoT and finds it improves accuracy but the reasoning traces sometimes contain logical errors. What should they do?
1. Ignore the errors since accuracy improved
2. Stop using CoT entirely since it's unreliable
3. Remove the reasoning traces from the output
4. Continue using CoT but validate the final answers against ground truth

← Back to interactive lesson

Tendril · Creators · Prompting

Few-Shot Example Curation: Quality, Rotation, and Counter-Examples, Part 1

Chain-of-thought prompts show real performance gains on reasoning tasks — and zero benefit on tasks that don't need reasoning. Here's how to tell which is which.

40 min · Reviewed 2026

The premise

Chain-of-thought is not a universal upgrade; it helps on reasoning-bound tasks and is overhead everywhere else.

What AI does well here

Use CoT on tasks requiring multi-step reasoning (math, complex logic, multi-constraint problems)
Use few-shot CoT examples on reasoning tasks where the structure of reasoning matters
Hide CoT from end-user output when the reasoning isn't user-facing value
Evaluate with and without CoT to confirm benefit on YOUR task

What AI cannot do

Make non-reasoning tasks better with CoT (it just adds tokens)
Make CoT a substitute for fine-tuning on hard reasoning tasks
Trust the reasoning trace as ground truth (models can produce plausible-but-wrong reasoning)

Curating Few-Shot Examples for an LLM Prompt — Quality vs. Quantity

The premise

Few-shot examples teach the model your edge cases and your style — pick them like you're picking a teaching set, not filler.

What AI does well here

Cover the three most common input shapes you actually see
Include one tricky edge case your last release got wrong
Match the exact output format you expect, character for character
Surface the reasoning step if you want the model to externalize one

What AI cannot do

Generalize to a regime far outside the chosen examples
Replace a clear instruction — examples and instructions reinforce each other
Stay relevant when your data distribution shifts

Using Negative Examples in LLM Prompts — When 'Don't Do This' Helps

The premise

Showing the model an explicit wrong answer alongside the right one prevents specific failure modes — but can also seed them if done sloppily.

What AI does well here

Pair every negative example with the corrected version
Label them clearly: 'BAD' and 'GOOD' tagged blocks
Use sparingly — one or two negatives, not ten
Reserve them for failures you have actually observed

What AI cannot do

Substitute for clear positive examples
Stop the model from repeating the bad pattern in adjacent contexts
Generalize 'don't do X' beyond the literal example

Rotating Few-Shot Examples to Prevent Overfitting

The premise

Maintain an example pool larger than what fits in the prompt, and sample N examples per call with a stable hash for reproducibility.

What AI does well here

Reduce mimicry of one phrasing
Surface examples evenly over time
Detect example-set bugs faster

What AI cannot do

Replace evaluation against held-out cases
Compensate for biased pool composition
Guarantee any single output's quality

Prompting AI: few-shot examples that actually transfer

The premise

Few-shot examples teach the model your output shape and your edge-case handling. Examples that all look alike teach only the easy case; the model fails on anything off-distribution.

What AI does well here

Match the format of provided examples in new outputs
Generalize patterns shown across diverse examples
Handle cases similar to ones you demonstrated

What AI cannot do

Generalize from examples that all look the same
Recover gracefully from a case unlike any example
Tell you when an example was a poor choice

AI Prompting: Choose Few-Shot vs Fine-Tune Without Burning a Quarter

The premise

Teams over-invest in fine-tuning when 5-10 strong few-shot examples would solve the task; they also avoid fine-tuning when the cost arithmetic actually favors it.

What AI does well here

Score whether the task is style or knowledge
Estimate prompt-token cost with examples included
Compare against fine-tune training and inference cost
Recommend evals to compare both

What AI cannot do

Account for hidden ops cost of maintaining a fine-tune
Predict whether the model provider will release a better base model
Replace a real eval comparison

AI and few-shot example selection

The premise

Few-shot examples teach the model the shape of the answer. Choosing diverse, edge-leaning examples beats stacking similar ones.

What AI does well here

Suggest covering edge cases in examples.
Help format examples consistently.
Spot when examples contradict each other.

What AI cannot do

Know which examples your model will weight most.
Replace systematic eval.
Guarantee a behavior change from one swap.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prompting-chain-of-thought-creators

A developer tests chain-of-thought prompting on three different tasks: (1) extracting names from emails, (2) solving logic puzzles, and (3) generating a welcome message. On which task would CoT be MOST likely to improve the AI's output quality?
1. Extracting names from emails, because it requires careful parsing
2. Solving logic puzzles, because it requires multi-step reasoning
3. Generating a welcome message, because it requires creativity
4. None of these tasks would benefit from CoT
A student adds step-by-step reasoning to a prompt asking an AI to rewrite a paragraph in simpler language. What is the MOST likely effect?
1. The output will be more concise
2. The AI will be able to explain its rewriting choices better
3. The cost and latency will increase without improving quality
4. The rewritten paragraph will be more accurate
An AI produces a chain-of-thought trace showing each step of its calculation, then gives a final answer. Why should you NOT treat the reasoning trace as ground truth?
1. The model can generate plausible-sounding but incorrect reasoning
2. The trace is never displayed to users
3. The reasoning uses too much memory
4. The trace is generated after the answer, not before
When should you HIDE the chain-of-thought reasoning from the end user?
1. When the user didn't ask for explanation
2. When the reasoning isn't user-facing value and would just add noise
3. When the reasoning contains mathematical symbols
4. When the reasoning is longer than the final answer
A team is evaluating whether chain-of-thought helps their task. They create two prompt versions and measure accuracy on both. What other metrics should they measure?
1. How confident the model sounds
2. Only accuracy, since that's what matters most
3. The length of the reasoning trace
4. Latency and cost, because CoT adds tokens and processing time
A developer wants to use few-shot chain-of-thought examples for a reasoning task. What is the PRIMARY purpose of including these examples?
1. To demonstrate the structure and format of reasoning the task requires
2. To show the model what the correct answer is
3. To make the prompt easier for users to read
4. To reduce the number of tokens in the prompt
An AI fails a complex math problem even when using chain-of-thought prompting. What should the developer try NEXT?
1. Fine-tune the model on math problems
2. Add more CoT examples to the prompt
3. Use a different approach entirely since CoT failed
4. Ask the AI to try again with different wording
What does it mean to 'validate the answer, not the trace'?
1. The answer should match what the model previously said
2. The final answer should be checked against ground truth regardless of how the reasoning looks
3. The reasoning trace should be saved for later analysis
4. The trace should be hidden to prevent bias
A product team implements CoT for their AI assistant. During evaluation, they discover CoT improves accuracy on their test set but doubles the response time. What should they conclude?
1. The trade-off should be analyzed against their latency requirements
2. CoT is beneficial and should be used
3. CoT always helps on reasoning tasks
4. CoT introduces new failure modes
Which scenario represents 'theater' - using CoT where it provides no real benefit?
1. Adding CoT to a prompt that just asks for a yes/no answer
2. Using CoT for math word problems
3. Using CoT to debug code errors
4. Using CoT for a complex scheduling problem with many constraints
When setting up an evaluation to test if CoT helps your specific task, what is required for the test set?
1. A representative sample with known correct answers
2. At least 1,000 examples for statistical significance
3. Examples that the AI has never seen before
4. Only difficult examples to test limits
What failure mode should you specifically look for when evaluating whether CoT introduces new problems?
1. CoT causes the model to refuse more requests
2. The reasoning trace is too long to read
3. The model uses too many tokens
4. CoT produces convincing but incorrect reasoning that leads to wrong answers
An AI assistant uses CoT internally to solve a user's problem but only shows the final answer. Why might this be the RIGHT approach?
1. Chain-of-thought is always hidden to prevent confusion
2. The reasoning is internal work that adds no user value
3. The model shouldn't show its work
4. The user doesn't care about reasoning
What distinguishes a reasoning-bound task from a non-reasoning task?
1. Reasoning tasks require multiple steps or logic to reach an answer
2. Reasoning tasks use more tokens
3. Reasoning tasks are only about math
4. Reasoning tasks are always longer
A developer tests CoT and finds it improves accuracy but the reasoning traces sometimes contain logical errors. What should they do?
1. Ignore the errors since accuracy improved
2. Stop using CoT entirely since it's unreliable
3. Remove the reasoning traces from the output
4. Continue using CoT but validate the final answers against ground truth

← Back to interactive lesson