Few-Shot Example Curation: Quality, Rotation, and Counter-Examples, Part 1
Chain-of-thought prompts show real performance gains on reasoning tasks — and zero benefit on tasks that don't need reasoning. Here's how to tell which is which.
40 min · Reviewed 2026
The premise
Chain-of-thought is not a universal upgrade; it helps on reasoning-bound tasks and is overhead everywhere else.
What AI does well here
Use CoT on tasks requiring multi-step reasoning (math, complex logic, multi-constraint problems)
Use few-shot CoT examples on reasoning tasks where the structure of reasoning matters
Hide CoT from end-user output when the reasoning isn't user-facing value
Evaluate with and without CoT to confirm benefit on YOUR task
What AI cannot do
Make non-reasoning tasks better with CoT (it just adds tokens)
Make CoT a substitute for fine-tuning on hard reasoning tasks
Trust the reasoning trace as ground truth (models can produce plausible-but-wrong reasoning)
Curating Few-Shot Examples for an LLM Prompt — Quality vs. Quantity
The premise
Few-shot examples teach the model your edge cases and your style — pick them like you're picking a teaching set, not filler.
What AI does well here
Cover the three most common input shapes you actually see
Include one tricky edge case your last release got wrong
Match the exact output format you expect, character for character
Surface the reasoning step if you want the model to externalize one
What AI cannot do
Generalize to a regime far outside the chosen examples
Replace a clear instruction — examples and instructions reinforce each other
Stay relevant when your data distribution shifts
Using Negative Examples in LLM Prompts — When 'Don't Do This' Helps
The premise
Showing the model an explicit wrong answer alongside the right one prevents specific failure modes — but can also seed them if done sloppily.
What AI does well here
Pair every negative example with the corrected version
Label them clearly: 'BAD' and 'GOOD' tagged blocks
Use sparingly — one or two negatives, not ten
Reserve them for failures you have actually observed
What AI cannot do
Substitute for clear positive examples
Stop the model from repeating the bad pattern in adjacent contexts
Generalize 'don't do X' beyond the literal example
Rotating Few-Shot Examples to Prevent Overfitting
The premise
Maintain an example pool larger than what fits in the prompt, and sample N examples per call with a stable hash for reproducibility.
What AI does well here
Reduce mimicry of one phrasing
Surface examples evenly over time
Detect example-set bugs faster
What AI cannot do
Replace evaluation against held-out cases
Compensate for biased pool composition
Guarantee any single output's quality
Prompting AI: few-shot examples that actually transfer
The premise
Few-shot examples teach the model your output shape and your edge-case handling. Examples that all look alike teach only the easy case; the model fails on anything off-distribution.
What AI does well here
Match the format of provided examples in new outputs
Generalize patterns shown across diverse examples
Handle cases similar to ones you demonstrated
What AI cannot do
Generalize from examples that all look the same
Recover gracefully from a case unlike any example
Tell you when an example was a poor choice
AI Prompting: Choose Few-Shot vs Fine-Tune Without Burning a Quarter
The premise
Teams over-invest in fine-tuning when 5-10 strong few-shot examples would solve the task; they also avoid fine-tuning when the cost arithmetic actually favors it.
What AI does well here
Score whether the task is style or knowledge
Estimate prompt-token cost with examples included
Compare against fine-tune training and inference cost
Recommend evals to compare both
What AI cannot do
Account for hidden ops cost of maintaining a fine-tune
Predict whether the model provider will release a better base model
Replace a real eval comparison
AI and few-shot example selection
The premise
Few-shot examples teach the model the shape of the answer. Choosing diverse, edge-leaning examples beats stacking similar ones.
What AI does well here
Suggest covering edge cases in examples.
Help format examples consistently.
Spot when examples contradict each other.
What AI cannot do
Know which examples your model will weight most.
Replace systematic eval.
Guarantee a behavior change from one swap.
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prompting-chain-of-thought-creators
A developer tests chain-of-thought prompting on three different tasks: (1) extracting names from emails, (2) solving logic puzzles, and (3) generating a welcome message. On which task would CoT be MOST likely to improve the AI's output quality?
Extracting names from emails, because it requires careful parsing
Solving logic puzzles, because it requires multi-step reasoning
Generating a welcome message, because it requires creativity
None of these tasks would benefit from CoT
A student adds step-by-step reasoning to a prompt asking an AI to rewrite a paragraph in simpler language. What is the MOST likely effect?
The output will be more concise
The AI will be able to explain its rewriting choices better
The cost and latency will increase without improving quality
The rewritten paragraph will be more accurate
An AI produces a chain-of-thought trace showing each step of its calculation, then gives a final answer. Why should you NOT treat the reasoning trace as ground truth?
The model can generate plausible-sounding but incorrect reasoning
The trace is never displayed to users
The reasoning uses too much memory
The trace is generated after the answer, not before
When should you HIDE the chain-of-thought reasoning from the end user?
When the user didn't ask for explanation
When the reasoning isn't user-facing value and would just add noise
When the reasoning contains mathematical symbols
When the reasoning is longer than the final answer
A team is evaluating whether chain-of-thought helps their task. They create two prompt versions and measure accuracy on both. What other metrics should they measure?
How confident the model sounds
Only accuracy, since that's what matters most
The length of the reasoning trace
Latency and cost, because CoT adds tokens and processing time
A developer wants to use few-shot chain-of-thought examples for a reasoning task. What is the PRIMARY purpose of including these examples?
To demonstrate the structure and format of reasoning the task requires
To show the model what the correct answer is
To make the prompt easier for users to read
To reduce the number of tokens in the prompt
An AI fails a complex math problem even when using chain-of-thought prompting. What should the developer try NEXT?
Fine-tune the model on math problems
Add more CoT examples to the prompt
Use a different approach entirely since CoT failed
Ask the AI to try again with different wording
What does it mean to 'validate the answer, not the trace'?
The answer should match what the model previously said
The final answer should be checked against ground truth regardless of how the reasoning looks
The reasoning trace should be saved for later analysis
The trace should be hidden to prevent bias
A product team implements CoT for their AI assistant. During evaluation, they discover CoT improves accuracy on their test set but doubles the response time. What should they conclude?
The trade-off should be analyzed against their latency requirements
CoT is beneficial and should be used
CoT always helps on reasoning tasks
CoT introduces new failure modes
Which scenario represents 'theater' - using CoT where it provides no real benefit?
Adding CoT to a prompt that just asks for a yes/no answer
Using CoT for math word problems
Using CoT to debug code errors
Using CoT for a complex scheduling problem with many constraints
When setting up an evaluation to test if CoT helps your specific task, what is required for the test set?
A representative sample with known correct answers
At least 1,000 examples for statistical significance
Examples that the AI has never seen before
Only difficult examples to test limits
What failure mode should you specifically look for when evaluating whether CoT introduces new problems?
CoT causes the model to refuse more requests
The reasoning trace is too long to read
The model uses too many tokens
CoT produces convincing but incorrect reasoning that leads to wrong answers
An AI assistant uses CoT internally to solve a user's problem but only shows the final answer. Why might this be the RIGHT approach?
Chain-of-thought is always hidden to prevent confusion
The reasoning is internal work that adds no user value
The model shouldn't show its work
The user doesn't care about reasoning
What distinguishes a reasoning-bound task from a non-reasoning task?
Reasoning tasks require multiple steps or logic to reach an answer
Reasoning tasks use more tokens
Reasoning tasks are only about math
Reasoning tasks are always longer
A developer tests CoT and finds it improves accuracy but the reasoning traces sometimes contain logical errors. What should they do?
Ignore the errors since accuracy improved
Stop using CoT entirely since it's unreliable
Remove the reasoning traces from the output
Continue using CoT but validate the final answers against ground truth