Statistical Significance and P-Values

P-value is one of the most abused numbers in research. Here is what it actually says — and what it does not. 'Model B is no better than model A.' 'The new prompt does not change user satisfaction.' A low p-value means the boring story would rarely produce data that looks like what you saw.

28 min · Reviewed 2026

A Number Everyone Quotes, Almost Nobody Understands

You see p < 0.05 in papers and headlines constantly. What does it actually mean? Precisely: if the null hypothesis were true, the probability of seeing a result this extreme or more extreme is less than 5 percent.

The null hypothesis

The null hypothesis is the boring story. 'Model B is no better than model A.' 'The new prompt does not change user satisfaction.' A low p-value means the boring story would rarely produce data that looks like what you saw.

Common abuses

P-hacking: running many tests and reporting the significant ones
Garden of forking paths: trying many analyses until something 'works'
Publication bias: significant results get published; non-significant ones do not
Confusing statistical and practical significance

Phrase heard	What it actually means
'Statistically significant'	P-value below threshold, under one analysis
'Not statistically significant'	Might mean no effect, or might mean not enough data
'Highly significant (p<0.001)'	Less likely by chance under null — but still not proof
'Effect size'	The number that actually matters

The difference between 'significant' and 'not significant' is not itself statistically significant.
— Gelman and Stern (2006)

The big idea: p-values are one weak piece of evidence, often presented as if they were a verdict. Effect size and replication matter more.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-statistical-significance

What is the core idea behind "Statistical Significance and P-Values"?
1. P-value is one of the most abused numbers in research. Here is what it actually says — and what it does not. 'Model B is no better than model A.' 'The new prompt does not change user satisfaction.' A low p-value means the boring story would rarely produce data that looks like what you saw.
2. Workshop papers are less selective but often more experimental
3. Training data over-represents English-speaking, internet-active people
4. Before LLMs-as-judges, researchers had hand-made metrics.
Which term best describes a foundational idea in "Statistical Significance and P-Values"?
1. null hypothesis
2. p-value
3. significance
4. effect size
A learner studying Statistical Significance and P-Values would need to understand which concept?
1. p-value
2. significance
3. null hypothesis
4. effect size
Which of these is directly relevant to Statistical Significance and P-Values?
1. p-value
2. null hypothesis
3. effect size
4. significance
Which of the following is a key point about Statistical Significance and P-Values?
1. P-hacking: running many tests and reporting the significant ones
2. Garden of forking paths: trying many analyses until something 'works'
3. Publication bias: significant results get published; non-significant ones do not
4. Confusing statistical and practical significance
Which of these does NOT belong in a discussion of Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. P-hacking: running many tests and reporting the significant ones
3. Publication bias: significant results get published; non-significant ones do not
4. Garden of forking paths: trying many analyses until something 'works'
What is the key insight about "What p-value is not" in the context of Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. P-value is NOT the probability that the null is true. It is not the probability the effect is real. And p=0.
4. Before LLMs-as-judges, researchers had hand-made metrics.
What is the key insight about "Always report effect size" in the context of Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. Before LLMs-as-judges, researchers had hand-made metrics.
4. A tiny effect can be p<0.001 with enough samples. A huge effect can be p=0.3 with 10 samples.
What is the recommended tip about "Build your mental model" in the context of Statistical Significance and P-Values?
1. AI isn't magic — it's pattern recognition at scale. The more you understand how it works, the more effectively you can u…
2. Workshop papers are less selective but often more experimental
3. Training data over-represents English-speaking, internet-active people
4. Before LLMs-as-judges, researchers had hand-made metrics.
Which statement accurately describes an aspect of Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. You see p < 0.05 in papers and headlines constantly. What does it actually mean? Precisely: if the null hypothesis were true, the probabilit…
3. Training data over-represents English-speaking, internet-active people
4. Before LLMs-as-judges, researchers had hand-made metrics.
What does working with Statistical Significance and P-Values typically involve?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. The null hypothesis is the boring story. 'Model B is no better than model A.' 'The new prompt does not change user satisfaction.
4. Before LLMs-as-judges, researchers had hand-made metrics.
Which of the following is true about Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. Before LLMs-as-judges, researchers had hand-made metrics.
4. The big idea: p-values are one weak piece of evidence, often presented as if they were a verdict. Effect size and replication matter more.
Which best describes the scope of "Statistical Significance and P-Values"?
1. It focuses on P-value is one of the most abused numbers in research. Here is what it actually says — and what it d
2. It is unrelated to foundations workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. The null hypothesis
3. Training data over-represents English-speaking, internet-active people
4. Before LLMs-as-judges, researchers had hand-made metrics.
Which section heading best belongs in a lesson about Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. Common abuses
4. Before LLMs-as-judges, researchers had hand-made metrics.

← Back to interactive lesson

Tendril · Builders · AI Foundations

Statistical Significance and P-Values

28 min · Reviewed 2026

A Number Everyone Quotes, Almost Nobody Understands

The null hypothesis

Common abuses

P-hacking: running many tests and reporting the significant ones
Garden of forking paths: trying many analyses until something 'works'
Publication bias: significant results get published; non-significant ones do not
Confusing statistical and practical significance

Phrase heard	What it actually means
'Statistically significant'	P-value below threshold, under one analysis
'Not statistically significant'	Might mean no effect, or might mean not enough data
'Highly significant (p<0.001)'	Less likely by chance under null — but still not proof
'Effect size'	The number that actually matters

The difference between 'significant' and 'not significant' is not itself statistically significant.
— Gelman and Stern (2006)

The big idea: p-values are one weak piece of evidence, often presented as if they were a verdict. Effect size and replication matter more.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-statistical-significance

What is the core idea behind "Statistical Significance and P-Values"?
1. P-value is one of the most abused numbers in research. Here is what it actually says — and what it does not. 'Model B is no better than model A.' 'The new prompt does not change user satisfaction.' A low p-value means the boring story would rarely produce data that looks like what you saw.
2. Workshop papers are less selective but often more experimental
3. Training data over-represents English-speaking, internet-active people
4. Before LLMs-as-judges, researchers had hand-made metrics.
Which term best describes a foundational idea in "Statistical Significance and P-Values"?
1. null hypothesis
2. p-value
3. significance
4. effect size
A learner studying Statistical Significance and P-Values would need to understand which concept?
1. p-value
2. significance
3. null hypothesis
4. effect size
Which of these is directly relevant to Statistical Significance and P-Values?
1. p-value
2. null hypothesis
3. effect size
4. significance
Which of the following is a key point about Statistical Significance and P-Values?
1. P-hacking: running many tests and reporting the significant ones
2. Garden of forking paths: trying many analyses until something 'works'
3. Publication bias: significant results get published; non-significant ones do not
4. Confusing statistical and practical significance
Which of these does NOT belong in a discussion of Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. P-hacking: running many tests and reporting the significant ones
3. Publication bias: significant results get published; non-significant ones do not
4. Garden of forking paths: trying many analyses until something 'works'
What is the key insight about "What p-value is not" in the context of Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. P-value is NOT the probability that the null is true. It is not the probability the effect is real. And p=0.
4. Before LLMs-as-judges, researchers had hand-made metrics.
What is the key insight about "Always report effect size" in the context of Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. Before LLMs-as-judges, researchers had hand-made metrics.
4. A tiny effect can be p<0.001 with enough samples. A huge effect can be p=0.3 with 10 samples.
What is the recommended tip about "Build your mental model" in the context of Statistical Significance and P-Values?
1. AI isn't magic — it's pattern recognition at scale. The more you understand how it works, the more effectively you can u…
2. Workshop papers are less selective but often more experimental
3. Training data over-represents English-speaking, internet-active people
4. Before LLMs-as-judges, researchers had hand-made metrics.
Which statement accurately describes an aspect of Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. You see p < 0.05 in papers and headlines constantly. What does it actually mean? Precisely: if the null hypothesis were true, the probabilit…
3. Training data over-represents English-speaking, internet-active people
4. Before LLMs-as-judges, researchers had hand-made metrics.
What does working with Statistical Significance and P-Values typically involve?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. The null hypothesis is the boring story. 'Model B is no better than model A.' 'The new prompt does not change user satisfaction.
4. Before LLMs-as-judges, researchers had hand-made metrics.
Which of the following is true about Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. Before LLMs-as-judges, researchers had hand-made metrics.
4. The big idea: p-values are one weak piece of evidence, often presented as if they were a verdict. Effect size and replication matter more.
Which best describes the scope of "Statistical Significance and P-Values"?
1. It focuses on P-value is one of the most abused numbers in research. Here is what it actually says — and what it d
2. It is unrelated to foundations workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. The null hypothesis
3. Training data over-represents English-speaking, internet-active people
4. Before LLMs-as-judges, researchers had hand-made metrics.
Which section heading best belongs in a lesson about Statistical Significance and P-Values?
1. Workshop papers are less selective but often more experimental
2. Training data over-represents English-speaking, internet-active people
3. Common abuses
4. Before LLMs-as-judges, researchers had hand-made metrics.

← Back to interactive lesson