Tendril

In the wild · 115 practitioners · real posts

Here’s what’s actually happening on X.

Real posts · verified first

Real posts from people who actually shipped this. Verified entries first, with attribution.

0 confirmed links3 cited from news112 representative

Showing 2 of 115 postsEvery entry links back to X

Arvind Narayanan@random_walker

Professor, Princeton; co-author AI Snake Oil

Representative

Evaluating AI product claims11 months agoClaim stress-testing

Representative of Arvind's threads on stress-testing AI vendor claims — asking for the specific benchmark, the baseline, the sample size, and whether the evaluation was done on private data.

“Every capability claim is three questions away from collapse. Ask the three questions.”

How to replicate

1.For any capability claim in a pitch or paper, write out the exact metric being reported.
2.Ask: what was the baseline? On what data? Sample size?

Prompt template

I have this AI product claim: <paste>. Draft a 5-question email to the vendor that isolates: (1) the exact metric, (2) the baseline comparison, (3) the evaluation dataset and whether it's public, (4) sample size and variance, (5) reproduction conditions. Keep the tone polite but non-negotiable. Do not accept marketing language in the reply.

Pitfall

Accepting 'our model outperforms GPT-4' without asking 'on what.' Models outperform each other on narrow slices all the time; the slice is the whole claim.

What you'll learn

•Which questions collapse most capability claims
•Why benchmark contamination is the default assumption until disproven
•How to read vendor PDFs like a reviewer
•When to trust a claim despite incomplete evidence

Glossarybenchmark hallucination rag

#claims#evals#hype

Jan Leike@janleike

Alignment researcher; Anthropic

Representative

Scalable oversight in practice1.2 years agoCritic-model oversight

Representative of Jan's threads on how alignment teams use weaker models to critique the outputs of stronger models — catching errors a human reviewer would miss because of volume.

“If you can't oversee it, you can't align it. Oversight scales or it doesn't.”

How to replicate

1.Pick a task where the strong model's outputs are too numerous to human-review.
2.Define a rubric of failure modes (factual, unsafe, off-spec).

Prompt template

You are a critic model. Read the output below produced by a stronger model for this task: <task>. Score it against this rubric: <list of failure modes>. For each mode, return pass/fail and quote the exact span that triggered the fail. Do not judge style — only the listed modes. If in doubt, mark fail.

Pitfall

Treating the critic as the ground truth. The critic is a filter for human attention, not a replacement for it — measure the critic against humans, not the other way around.

What you'll learn

•Why oversight is the rate-limiter on deploying strong models
•How a weaker critic can provide useful signal on a stronger model
•How to measure critic quality instead of assuming it
•Where scalable oversight research is still open

Glossaryrlhf hallucination system prompt

#alignment#oversight#evals

About this page(tap to expand)

Curated examples of practitioners using AI in the wild — builders, sellers, ops leads, teachers, artists. Every entry links to the original post. When we can’t verify a specific URL, we flag the entry as representative rather than inventing one.

Tendril keeps this list conservative on purpose. Entries marked representativedescribe patterns we’ve seen practitioners discuss publicly on X but whose exact URLs we haven’t individually verified. We’d rather flag an example as representative than fabricate a precise post-ID. If you see a wrong attribution or a broken link, tell us and we’ll fix it within the day.

In the wild · 115 practitioners · real posts

Here’s what’s actually happening on X.

Real posts · verified first

Real posts from people who actually shipped this. Verified entries first, with attribution.

0 confirmed links3 cited from news112 representative

Showing 2 of 115 postsEvery entry links back to X

Arvind Narayanan@random_walker

Professor, Princeton; co-author AI Snake Oil

Representative

Evaluating AI product claims11 months agoClaim stress-testing

Representative of Arvind's threads on stress-testing AI vendor claims — asking for the specific benchmark, the baseline, the sample size, and whether the evaluation was done on private data.

“Every capability claim is three questions away from collapse. Ask the three questions.”

How to replicate

1.For any capability claim in a pitch or paper, write out the exact metric being reported.
2.Ask: what was the baseline? On what data? Sample size?

Prompt template

I have this AI product claim: <paste>. Draft a 5-question email to the vendor that isolates: (1) the exact metric, (2) the baseline comparison, (3) the evaluation dataset and whether it's public, (4) sample size and variance, (5) reproduction conditions. Keep the tone polite but non-negotiable. Do not accept marketing language in the reply.

Pitfall

Accepting 'our model outperforms GPT-4' without asking 'on what.' Models outperform each other on narrow slices all the time; the slice is the whole claim.

What you'll learn

•Which questions collapse most capability claims
•Why benchmark contamination is the default assumption until disproven
•How to read vendor PDFs like a reviewer
•When to trust a claim despite incomplete evidence

Glossarybenchmark hallucination rag

#claims#evals#hype

Jan Leike@janleike

Alignment researcher; Anthropic

Representative

Scalable oversight in practice1.2 years agoCritic-model oversight

Representative of Jan's threads on how alignment teams use weaker models to critique the outputs of stronger models — catching errors a human reviewer would miss because of volume.

“If you can't oversee it, you can't align it. Oversight scales or it doesn't.”

How to replicate

1.Pick a task where the strong model's outputs are too numerous to human-review.
2.Define a rubric of failure modes (factual, unsafe, off-spec).

Prompt template

You are a critic model. Read the output below produced by a stronger model for this task: <task>. Score it against this rubric: <list of failure modes>. For each mode, return pass/fail and quote the exact span that triggered the fail. Do not judge style — only the listed modes. If in doubt, mark fail.

Pitfall

Treating the critic as the ground truth. The critic is a filter for human attention, not a replacement for it — measure the critic against humans, not the other way around.

What you'll learn

•Why oversight is the rate-limiter on deploying strong models
•How a weaker critic can provide useful signal on a stronger model
•How to measure critic quality instead of assuming it
•Where scalable oversight research is still open

Glossaryrlhf hallucination system prompt

#alignment#oversight#evals

About this page(tap to expand)