Skip to main content

neural-forge.io

Learn Tracks Models AI Explorer Compare

Sign inStartStart learning

Tendril

Tendril neural-forge.io

Free AI literacy for everyone, supported by trust-safe partners.

Learn

Curriculum
Tracks
For you
Preferences

Resources

Glossary
In the Wild
Newsroom
Community
Partners
Send Feedback
Changelog
About
New to AI?

Schools & Orgs

Schools
Libraries
Tech Teams
Free Access
Sponsor
Sign Up
Support the Mission

Trust

Privacy
Terms
COPPA
Accessibility

Legal

Privacy
Terms
COPPA
Accessibility

© 2026 Tendril·Privacy·Terms·Contact

Built with Claude

Loading lesson…

Tendril

Ethics & Society0%

Time on lesson

0s

← Ethics & Society

0 of 196 complete

○Lesson 331The EU AI Act: The Global Floor, Whether You Like It or Not
○Lesson 332AI Alignment: The Actual Technical Problem
○Lesson 333Red-Teaming: The Ethics of Breaking AI on Purpose
○Lesson 334Jailbreak Case Studies: What Actually Broke
○Lesson 335Labor and AI: What the Data Actually Says
○Lesson 336Creative Rights: Artists, Writers, Musicians vs. Generative AI
○Lesson 337AI Safety Orgs and How They Actually Operate
○Lesson 338Responsible Scaling Policies Explained
○Lesson 339Your Own Ethical Checklist as an AI Builder
○Lesson 811Alignment: The Full Technical Picture
○Lesson 812Specification Gaming, Reward Hacking, and the Goodhart Tax
○Lesson 813Mesa-Optimization: An Optimizer Inside Your Optimizer
○Lesson 814RLHF to RLAIF: How Preference Learning Scaled
○Lesson 815Reward Hacking in the Wild: Cases From Real Labs
○Lesson 816Deceptive Alignment: The Failure Mode Everyone Talks About
○Lesson 817Goal Misgeneralization: The Right Reward, The Wrong Learned Goal
○Lesson 818Constitutional AI: A Deep Dive on Anthropic's Approach
○Lesson 819Scalable Oversight: How Do You Supervise What You Cannot Evaluate
○Lesson 820Mechanistic Interpretability: Reading the Model's Mind
○Lesson 821Data Poisoning: Attacking AI Through Its Training Set
○Lesson 822Model Extraction and Distillation Attacks
○Lesson 824Weak-to-Strong Generalization
○Lesson 840Know-Your-Customer Rules for AI Compute
○Lesson 841Model Disclosure Requirements
○Lesson 842Safety Evaluations: What Gets Disclosed
○Lesson 844The AI Insurance Industry
○Lesson 845UK AI Safety Institute
○Lesson 846Singapore's AI Verify
○Lesson 847China's Generative AI Regulations
○Lesson 849Bio Risk and AI: A Measured Look
○Lesson 850Cyber Risk and Autonomous AI Attackers
○Lesson 1603Writing Your Own HS AI Honor Code
○Lesson 1608AI For Relationship Advice — When To Trust It
○Lesson 1609AI For Mental Health Support — What's Safe
○Lesson 2494Debate as an Alignment Method
○Lesson 2495Iterative Amplification
○Lesson 2496Training-Time vs. Inference-Time Alignment
○Lesson 2497Alignment Faking: When Models Pretend
○Lesson 2498Deceptive Alignment: From Theory to Data
Lesson 2499Sparse Autoencoders Explained
○Lesson 2500Feature Discovery in LLMs
○Lesson 2501Probing: Linear, Nonlinear, and Contrast
○Lesson 2502Activation Patching: Intervention Experiments
○Lesson 2503SB 1047: California's AI Safety Bill
○Lesson 2504The US Executive Order on AI and What Happened Next
○Lesson 22500AI Attribution Norms: When and How to Disclose AI Involvement in Your Work
○Lesson 22501AI's Environmental Impact: Honest Numbers for Personal and Organizational Decisions
○Lesson 22502AI's Labor Impact: Honest Conversations About What's Actually Changing
○Lesson 22503AI in Content Moderation: The Ethics of Scale, Speed, and Inevitable Mistakes
○Lesson 22504Ethics of AI in Academic Research: Beyond Plagiarism Detection
○Lesson 24000AI's Effect on Democratic Discourse: Where to Pay Attention
○Lesson 24001AI Monoculture: Why Everyone Sounding the Same Matters
○Lesson 24002AI Resurrection of the Dead: Grieftech's Hard Questions
○Lesson 24003AI in Religious and Spiritual Life: Where Communities Are Drawing Lines
○Lesson 24004AI and Disability Rights: Both Tool and Threat
○Lesson 25300AI and the Future of Truth-Finding
○Lesson 25301AI and the Loneliness Epidemic: Help or Harm?
○Lesson 25302AI's Effect on Creative Economies: How Artists Are Adapting
○Lesson 25303AI in Criminal Justice: Where Bias Has Real Consequences
○Lesson 25304Who Has the Power Over AI: A Concentration Problem
○Lesson 26900Trust Erosion in the AI Era: Personal Commitments That Help
○Lesson 26901Recommending AI Tools Ethically
○Lesson 26902AI and Language Preservation: Who Decides
○Lesson 26903AI in Children's Media: Higher Bar Than Adult Content
○Lesson 26904Collective Action on AI Ethics: Beyond Personal Choices
○Lesson 27800AI and the Attention Economy: Personal Resistance
○Lesson 27801AI and the Dignity of Labor
○Lesson 27802AI and Environmental Justice: Where Data Centers Land
○Lesson 27803AI and Elder Autonomy: Care vs Control
○Lesson 27804Personal Data Stewardship in the AI Era
○Lesson 28700AI and Power Asymmetry Between Companies and Users
○Lesson 28701Professional Norms for AI Use Across Fields
○Lesson 28702Data Cooperatives: An Alternative to Big-Tech Data Concentration
○Lesson 28703Academic Integrity in the AI Era: Evolution Underway
○Lesson 28704Good Disagreement About AI in Communities
○Lesson 29900Developing a Personal AI Use Policy
○Lesson 29901Developing Team Norms for AI Use
○Lesson 29902Ethics in AI Vendor Relationships
○Lesson 29903Public Comment Engagement on AI Regulation
○Lesson 29904Engaging With Algorithmic Accountability Reports
○Lesson 31400Personal Data Export Practices
○Lesson 31401Pushing Back Against AI Recommendation Systems
○Lesson 31402Correcting Misinformation Without Amplifying It
○Lesson 31403Strategic Boycotts of AI Products
○Lesson 31404Strategic Praise of AI Products That Get It Right
○Lesson 33000Personal AI Disclosure: When and How
○Lesson 33001Organizational AI Statements: Beyond Vague Principles
○Lesson 33002Corporate AI Environmental Impact Reporting
○Lesson 33003Employee Voice on AI Decisions
○Lesson 33004Pressuring AI Vendors on Ethics
○Lesson 34500Developing Personal AI Philosophy
○Lesson 34501Productive Conversations With AI Skeptics
○Lesson 34502Productive Conversations With AI Enthusiasts
○Lesson 34503Personal Resistance to AI's Worst Tendencies
○Lesson 34504Engaging Policymakers on AI
○Lesson 37000Using AI Vendor Due Diligence in Procurement
○Lesson 37001Designing AI Consent Flows That Respect Users
○Lesson 37002Writing Postmortems for AI System Incidents
○Lesson 37003Designing AI Bug Bounty and Disclosure Programs
○Lesson 37004Staging AI Deployments Ethically
○Lesson 37005Reporting AI Risk to Boards of Directors
○Lesson 37006Ethics of AI Procurement in the Public Sector
○Lesson 37007Norms for Publishing AI Research Responsibly
○Lesson 37008Ethics of AI Products Designed for Children
○Lesson 37009Planning Ethical Workforce Transitions Around AI
○Lesson 38500AI for Employee AI-Use Feedback Loops: Listening Before Mandating
○Lesson 38501AI for Vendor Model Card Reviews: Reading Between the Lines
○Lesson 38502AI for AI Grievance Process Design: A Way for People to Push Back
○Lesson 38503AI for Augmentation-vs-Replacement Framing: Honest Org Communication
○Lesson 38504AI for Shadow AI Policy Design: Channels, Not Just Bans
○Lesson 38505AI for Consent Language Readability: Plain Words That Still Hold Up Legally
○Lesson 38506AI for AI Ethics Training Curriculum: Designing What Sticks
○Lesson 38507AI for Deepfake Incident Response Plans: Ready Before You Need It
○Lesson 38508AI for Junior-Role Impact Assessments: The Pipeline Problem
○Lesson 40000AI vendor renewal fairness review checklist
○Lesson 40001AI internal prompt-use policy rollout plan
○Lesson 40003AI disability access review of internal AI prompts
○Lesson 40004AI policy for political content generation
○Lesson 40005AI customer redress process for AI-driven decisions
○Lesson 40006AI training data removal request handling process
○Lesson 40007AI vendor incident disclosure letter to customers
○Lesson 40009AI research participant debrief letter for AI studies
○Lesson 41500AI supplier code of conduct update for AI use
○Lesson 41501AI employee AI tool request review rubric
○Lesson 41502AI customer-facing AI use disclosure pattern library
○Lesson 41503AI customer data training opt-out process documentation
○Lesson 41505AI board AI risk quarterly update memo
○Lesson 41506AI customer AI fairness complaint investigation summary
○Lesson 41507AI acquired team AI norms onboarding document
○Lesson 41508AI vendor pricing change customer notification letter
○Lesson 41509AI internal AI prompt library governance policy
○Lesson 43401AI third-party model evaluation rubric for procurement teams
○Lesson 43402AI employee AI tool incident reporting flow design
○Lesson 43403AI vendor AI feature rollout customer notification letter
○Lesson 43404AI internal AI policy exception request process design
○Lesson 43405AI procurement fairness testing plan for vendor models
○Lesson 43406AI deceptive pattern audit checklist for AI features
○Lesson 43407AI employee handbook AI use section update draft
○Lesson 43408AI explainability statement for customers receiving AI decisions
○Lesson 43409AI board AI ethics policy annual revision memo
○Lesson 45300AI Museum Deaccession Narrative: Drafting Provenance-Aware Disclosure
○Lesson 45301AI Research Debriefing After Deception: Drafting Trauma-Aware Scripts
○Lesson 45302AI Academic Authorship Dispute Mediation: Drafting Resolution Frameworks
○Lesson 45303AI Human-Subjects Honoraria Equity Review: Aligning Compensation to Risk
○Lesson 45304AI Content-Moderation Appeals Drafting: Building User-Facing Explanations
○Lesson 45305AI Clinical-Trial Placebo Justification: Drafting Equipoise Narratives
○Lesson 45306AI Corporate Political-Spending Disclosure Drafting: Investor-Facing Transparency
○Lesson 45307AI Undergraduate-Research Credit Allocation: Drafting Mentor Frameworks
○Lesson 45308AI Personal-Data Deletion-Rights Workflow Drafting: GDPR and CCPA Alignment
○Lesson 45309AI Platform Creator-Payout Transparency: Drafting Statement Explainers
○Lesson 47300AI Employee-Monitoring Disclosure Narrative: Drafting Workplace-Surveillance Notices
○Lesson 47301AI Algorithmic-Pricing Fairness Narrative: Drafting Disparate-Impact Memos
○Lesson 47303AI Vendor AI-Risk-Assessment Narrative: Drafting Procurement-Stage Memos
○Lesson 47304AI Incident Disclosure-to-Users Narrative: Drafting Notification Letters
○Lesson 47305AI Political-Microtargeting Policy Narrative: Drafting Platform-Policy Memos
○Lesson 47306AI Deepfake-Image Takedown Narrative: Drafting Non-Consensual-Intimate-Image Responses
○Lesson 47307AI Research-Data Secondary-Use Narrative: Drafting Reuse-Justification Memos
○Lesson 47308AI Children's-Data COPPA-Treatment Narrative: Drafting Verifiable-Parental-Consent Memos
○Lesson 47309AI Sanctions-Screening False-Match Narrative: Drafting Customer-Communication Memos
○Lesson 49300AI Model Deprecation User-Impact Narrative: Drafting Sunset-Communication Summaries
○Lesson 49301AI Synthetic Data Consent Narrative: Drafting Consent-Inheritance Summaries
○Lesson 49302AI Content Attribution Policy Narrative: Drafting Newsroom Disclosure Summaries
○Lesson 49304AI Child Safety Evaluation Coverage Narrative: Drafting Threat-Model Coverage Summaries
○Lesson 49306AI Open-Weights Release Risk Narrative: Drafting Pre-Release Risk-Acceptance Summaries
○Lesson 49307AI Red-Team Finding Coordinated Disclosure Narrative: Drafting Vendor-Notification Summaries
○Lesson 49309AI Researcher Access Program Governance Narrative: Drafting Access-Tier Justification Summaries
○Lesson 51300AI Model Card Draft: Drafting With Human Oversight
○Lesson 53300AI and an AI-use disclosure template
○Lesson 53301AI and a bias pre-mortem checklist
○Lesson 53302AI and a data-minimization review
○Lesson 53303AI and a consent-form readability rewrite
○Lesson 53305AI and a stakeholder impact map
○Lesson 53307AI and a vendor AI due-diligence questionnaire
○Lesson 53308AI and a red-team prompt set
○Lesson 53309AI and a decision-rights doc for AI features
○Lesson 55301AI and Fairness Metric Selection Memo: Tradeoff Walkthrough
○Lesson 55304AI and Data Minimization Audit: Trimming the Training Set
○Lesson 55305AI and Evaluation Set Coverage Gaps: What's Missing From the Test
○Lesson 55307AI and Redress Mechanism Design Prompt: User Appeal Pathways
○Lesson 55309AI and Impact Assessment Stakeholder List: Who Should Be Heard
○Lesson 57302AI and Data Deletion Policies: User-Right Workflows
○Lesson 57305AI and Bias Audit Checklists: Pre-Deployment Reviews
○Lesson 57306AI and AI Incident Response Plans: When Models Misbehave
○Lesson 57307AI and Vendor AI Risk Questionnaires: Procurement Drafts
○Lesson 57308AI and AI Governance Charters: Cross-Functional Oversight
○Lesson 59300AI and Attribution Trails for Remix: Crediting the Whole Chain
○Lesson 59303AI and Revenue Share with Collaborators: Splits That Survive Success
○Lesson 59304AI and Audience Data Minimum-Viable Collection: Less Is Less Risk
○Lesson 59305AI and Likeness Licensing Language: Renting Your Face Without Losing It
○Lesson 59306AI and Audience Vulnerability Flags: Knowing Who's Watching
○Lesson 59307AI and Deepfake-of-Self Policies: Setting House Rules for Your Face
○Lesson 59308AI and Sponsorship Disclosure Checks: FTC-Proofing Every Post
○Lesson 59309AI and Anonymity Protection for Sources: De-Identifying Quotes
○Lesson 59310AI and Platform TOS Friction Mapping: Knowing the Rules That Bite
○Lesson 59311AI and the Criticism vs Harassment Line: Pre-Publication Pulse Check
○Lesson 59312AI and Correction and Retraction Flow: Owning Mistakes in Public

Curriculum
·
Creators
·
Ethics & Society
·
Sparse Autoencoders Explained

Lesson 858 of 2116

Sparse Autoencoders Explained

Neural networks mix many concepts into each neuron. Sparse autoencoders pull them apart into human-readable features. This is the workhorse of modern interpretability.

CreatorsEthics & Society~24 min readAdvancedBI5 · Societal ImpactBI3 · LearningPrint / PDF

Big idea

Neural networks mix many concepts into each neuron. Sparse autoencoders pull them apart into human-readable features. This is the workhorse of modern interpretability.

Lesson map

What this lesson covers

40 min13 blocks3 concepts

Learning path

The main moves in order

1The Superposition Problem
2sparse autoencoder
3superposition
4dictionary learning

Concept cluster

Terms to connect while reading

sparse autoencodersuperpositiondictionary learning

Read2

Sections3

Lists2

Notes4

Quotes1

Terms1

Section 1

The Superposition Problem

A neural network has way more concepts than neurons. A single neuron in GPT-2 might fire for 'this is a legal document,' 'the Golden Gate Bridge,' and 'word ending in -ing' all at once. This is superposition: the network packs many features into shared directions in activation space. If you want to see the features, you need to unmix them.

How a sparse autoencoder works

1Take activations from a specific layer of a trained model
2Pass them through an encoder into a much larger hidden space (often 8-64x wider)
3Force the hidden activations to be sparse — only a few fire per input
4Decode back to the original activations and minimize reconstruction loss
5Train over billions of tokens, then interpret the features the SAE learned

Anthropic's Scaling Monosemanticity (May 2024)

Anthropic trained SAEs on Claude 3 Sonnet and found ~34 million features. Many were human-readable: 'the Golden Gate Bridge,' 'code security vulnerabilities,' 'sycophancy.' Amplifying the Golden Gate feature caused Claude to claim it was the bridge.

Check-in 1. Got it so far?

What you can do with features

Identify safety-relevant concepts like deception, manipulation, bias
Steer model behavior by amplifying or suppressing specific features
Debug failures by checking which features fired
Detect distribution shift by watching feature activation patterns
Audit for concepts a model 'should not' have learned

Known limits

SAE features are not ground truth — they are a compression. Features can be noisy, redundant, or miss real concepts the network uses. And training SAEs on frontier models is computationally expensive — often comparable to training a small model.

“We have gone from not knowing what was inside these models to a crude map. The map is wrong in places, but it is a map.”
Chris Olah, Anthropic interpretability lead

Check-in 2. Got it so far?

Key terms in this lesson

sparse autoencoder
superposition
dictionary learning
monosemanticity

The big idea: SAEs turned interpretability from stamp-collecting into an engineering discipline. They are the main reason interpretability had a breakout year in 2024.

Key insight

Neural networks mix many concepts into each neuron. Sparse autoencoders pull them apart into human-readable features. This is the workhorse of modern interpretability. The best way to learn is to practice.

Check-in 3. Got it so far?

Lesson complete

You've completed "Sparse Autoencoders Explained". Mark this lesson done and keep going — every lesson builds on the last.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Sparse Autoencoders Explained”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Your question

Try one:

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going

Creators · 55 min
Mechanistic Interpretability: Reading the Model's Mind
Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 40 min
Jailbreak Case Studies: What Actually Broke
Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.

Previous: Deceptive Alignment: From Theory to Data

Feature Discovery in LLMs: Next

Report an error

Reading mode