Skip to main content

neural-forge.io

Learn Tracks Models AI Explorer Compare

Sign inStartStart learning

Tendril

Tendril neural-forge.io

Free AI literacy for everyone, supported by trust-safe partners.

Learn

Curriculum
Tracks
For you
Preferences

Resources

Glossary
In the Wild
Newsroom
Community
Partners
Send Feedback
Changelog
About
New to AI?

Schools & Orgs

Schools
Libraries
Tech Teams
Free Access
Sponsor
Sign Up
Support the Mission

Trust

Privacy
Terms
COPPA
Accessibility

Legal

Privacy
Terms
COPPA
Accessibility

© 2026 Tendril·Privacy·Terms·Contact

Built with Claude

Loading lesson…

Tendril

Ethics & Society0%

Time on lesson

0s

← Ethics & Society

0 of 196 complete

○Lesson 331The EU AI Act: The Global Floor, Whether You Like It or Not
○Lesson 332AI Alignment: The Actual Technical Problem
○Lesson 333Red-Teaming: The Ethics of Breaking AI on Purpose
○Lesson 334Jailbreak Case Studies: What Actually Broke
○Lesson 335Labor and AI: What the Data Actually Says
○Lesson 336Creative Rights: Artists, Writers, Musicians vs. Generative AI
○Lesson 337AI Safety Orgs and How They Actually Operate
○Lesson 338Responsible Scaling Policies Explained
○Lesson 339Your Own Ethical Checklist as an AI Builder
○Lesson 811Alignment: The Full Technical Picture
○Lesson 812Specification Gaming, Reward Hacking, and the Goodhart Tax
○Lesson 813Mesa-Optimization: An Optimizer Inside Your Optimizer
○Lesson 814RLHF to RLAIF: How Preference Learning Scaled
○Lesson 815Reward Hacking in the Wild: Cases From Real Labs
○Lesson 816Deceptive Alignment: The Failure Mode Everyone Talks About
○Lesson 817Goal Misgeneralization: The Right Reward, The Wrong Learned Goal
○Lesson 818Constitutional AI: A Deep Dive on Anthropic's Approach
○Lesson 819Scalable Oversight: How Do You Supervise What You Cannot Evaluate
○Lesson 820Mechanistic Interpretability: Reading the Model's Mind
○Lesson 821Data Poisoning: Attacking AI Through Its Training Set
○Lesson 822Model Extraction and Distillation Attacks
○Lesson 824Weak-to-Strong Generalization
○Lesson 840Know-Your-Customer Rules for AI Compute
○Lesson 841Model Disclosure Requirements
○Lesson 842Safety Evaluations: What Gets Disclosed
○Lesson 844The AI Insurance Industry
○Lesson 845UK AI Safety Institute
○Lesson 846Singapore's AI Verify
○Lesson 847China's Generative AI Regulations
○Lesson 849Bio Risk and AI: A Measured Look
○Lesson 850Cyber Risk and Autonomous AI Attackers
○Lesson 1603Writing Your Own HS AI Honor Code
○Lesson 1608AI For Relationship Advice — When To Trust It
○Lesson 1609AI For Mental Health Support — What's Safe
○Lesson 2494Debate as an Alignment Method
○Lesson 2495Iterative Amplification
○Lesson 2496Training-Time vs. Inference-Time Alignment
○Lesson 2497Alignment Faking: When Models Pretend
○Lesson 2498Deceptive Alignment: From Theory to Data
○Lesson 2499Sparse Autoencoders Explained
Lesson 2500Feature Discovery in LLMs
○Lesson 2501Probing: Linear, Nonlinear, and Contrast
○Lesson 2502Activation Patching: Intervention Experiments
○Lesson 2503SB 1047: California's AI Safety Bill
○Lesson 2504The US Executive Order on AI and What Happened Next
○Lesson 22500AI Attribution Norms: When and How to Disclose AI Involvement in Your Work
○Lesson 22501AI's Environmental Impact: Honest Numbers for Personal and Organizational Decisions
○Lesson 22502AI's Labor Impact: Honest Conversations About What's Actually Changing
○Lesson 22503AI in Content Moderation: The Ethics of Scale, Speed, and Inevitable Mistakes
○Lesson 22504Ethics of AI in Academic Research: Beyond Plagiarism Detection
○Lesson 24000AI's Effect on Democratic Discourse: Where to Pay Attention
○Lesson 24001AI Monoculture: Why Everyone Sounding the Same Matters
○Lesson 24002AI Resurrection of the Dead: Grieftech's Hard Questions
○Lesson 24003AI in Religious and Spiritual Life: Where Communities Are Drawing Lines
○Lesson 24004AI and Disability Rights: Both Tool and Threat
○Lesson 25300AI and the Future of Truth-Finding
○Lesson 25301AI and the Loneliness Epidemic: Help or Harm?
○Lesson 25302AI's Effect on Creative Economies: How Artists Are Adapting
○Lesson 25303AI in Criminal Justice: Where Bias Has Real Consequences
○Lesson 25304Who Has the Power Over AI: A Concentration Problem
○Lesson 26900Trust Erosion in the AI Era: Personal Commitments That Help
○Lesson 26901Recommending AI Tools Ethically
○Lesson 26902AI and Language Preservation: Who Decides
○Lesson 26903AI in Children's Media: Higher Bar Than Adult Content
○Lesson 26904Collective Action on AI Ethics: Beyond Personal Choices
○Lesson 27800AI and the Attention Economy: Personal Resistance
○Lesson 27801AI and the Dignity of Labor
○Lesson 27802AI and Environmental Justice: Where Data Centers Land
○Lesson 27803AI and Elder Autonomy: Care vs Control
○Lesson 27804Personal Data Stewardship in the AI Era
○Lesson 28700AI and Power Asymmetry Between Companies and Users
○Lesson 28701Professional Norms for AI Use Across Fields
○Lesson 28702Data Cooperatives: An Alternative to Big-Tech Data Concentration
○Lesson 28703Academic Integrity in the AI Era: Evolution Underway
○Lesson 28704Good Disagreement About AI in Communities
○Lesson 29900Developing a Personal AI Use Policy
○Lesson 29901Developing Team Norms for AI Use
○Lesson 29902Ethics in AI Vendor Relationships
○Lesson 29903Public Comment Engagement on AI Regulation
○Lesson 29904Engaging With Algorithmic Accountability Reports
○Lesson 31400Personal Data Export Practices
○Lesson 31401Pushing Back Against AI Recommendation Systems
○Lesson 31402Correcting Misinformation Without Amplifying It
○Lesson 31403Strategic Boycotts of AI Products
○Lesson 31404Strategic Praise of AI Products That Get It Right
○Lesson 33000Personal AI Disclosure: When and How
○Lesson 33001Organizational AI Statements: Beyond Vague Principles
○Lesson 33002Corporate AI Environmental Impact Reporting
○Lesson 33003Employee Voice on AI Decisions
○Lesson 33004Pressuring AI Vendors on Ethics
○Lesson 34500Developing Personal AI Philosophy
○Lesson 34501Productive Conversations With AI Skeptics
○Lesson 34502Productive Conversations With AI Enthusiasts
○Lesson 34503Personal Resistance to AI's Worst Tendencies
○Lesson 34504Engaging Policymakers on AI
○Lesson 37000Using AI Vendor Due Diligence in Procurement
○Lesson 37001Designing AI Consent Flows That Respect Users
○Lesson 37002Writing Postmortems for AI System Incidents
○Lesson 37003Designing AI Bug Bounty and Disclosure Programs
○Lesson 37004Staging AI Deployments Ethically
○Lesson 37005Reporting AI Risk to Boards of Directors
○Lesson 37006Ethics of AI Procurement in the Public Sector
○Lesson 37007Norms for Publishing AI Research Responsibly
○Lesson 37008Ethics of AI Products Designed for Children
○Lesson 37009Planning Ethical Workforce Transitions Around AI
○Lesson 38500AI for Employee AI-Use Feedback Loops: Listening Before Mandating
○Lesson 38501AI for Vendor Model Card Reviews: Reading Between the Lines
○Lesson 38502AI for AI Grievance Process Design: A Way for People to Push Back
○Lesson 38503AI for Augmentation-vs-Replacement Framing: Honest Org Communication
○Lesson 38504AI for Shadow AI Policy Design: Channels, Not Just Bans
○Lesson 38505AI for Consent Language Readability: Plain Words That Still Hold Up Legally
○Lesson 38506AI for AI Ethics Training Curriculum: Designing What Sticks
○Lesson 38507AI for Deepfake Incident Response Plans: Ready Before You Need It
○Lesson 38508AI for Junior-Role Impact Assessments: The Pipeline Problem
○Lesson 40000AI vendor renewal fairness review checklist
○Lesson 40001AI internal prompt-use policy rollout plan
○Lesson 40003AI disability access review of internal AI prompts
○Lesson 40004AI policy for political content generation
○Lesson 40005AI customer redress process for AI-driven decisions
○Lesson 40006AI training data removal request handling process
○Lesson 40007AI vendor incident disclosure letter to customers
○Lesson 40009AI research participant debrief letter for AI studies
○Lesson 41500AI supplier code of conduct update for AI use
○Lesson 41501AI employee AI tool request review rubric
○Lesson 41502AI customer-facing AI use disclosure pattern library
○Lesson 41503AI customer data training opt-out process documentation
○Lesson 41505AI board AI risk quarterly update memo
○Lesson 41506AI customer AI fairness complaint investigation summary
○Lesson 41507AI acquired team AI norms onboarding document
○Lesson 41508AI vendor pricing change customer notification letter
○Lesson 41509AI internal AI prompt library governance policy
○Lesson 43401AI third-party model evaluation rubric for procurement teams
○Lesson 43402AI employee AI tool incident reporting flow design
○Lesson 43403AI vendor AI feature rollout customer notification letter
○Lesson 43404AI internal AI policy exception request process design
○Lesson 43405AI procurement fairness testing plan for vendor models
○Lesson 43406AI deceptive pattern audit checklist for AI features
○Lesson 43407AI employee handbook AI use section update draft
○Lesson 43408AI explainability statement for customers receiving AI decisions
○Lesson 43409AI board AI ethics policy annual revision memo
○Lesson 45300AI Museum Deaccession Narrative: Drafting Provenance-Aware Disclosure
○Lesson 45301AI Research Debriefing After Deception: Drafting Trauma-Aware Scripts
○Lesson 45302AI Academic Authorship Dispute Mediation: Drafting Resolution Frameworks
○Lesson 45303AI Human-Subjects Honoraria Equity Review: Aligning Compensation to Risk
○Lesson 45304AI Content-Moderation Appeals Drafting: Building User-Facing Explanations
○Lesson 45305AI Clinical-Trial Placebo Justification: Drafting Equipoise Narratives
○Lesson 45306AI Corporate Political-Spending Disclosure Drafting: Investor-Facing Transparency
○Lesson 45307AI Undergraduate-Research Credit Allocation: Drafting Mentor Frameworks
○Lesson 45308AI Personal-Data Deletion-Rights Workflow Drafting: GDPR and CCPA Alignment
○Lesson 45309AI Platform Creator-Payout Transparency: Drafting Statement Explainers
○Lesson 47300AI Employee-Monitoring Disclosure Narrative: Drafting Workplace-Surveillance Notices
○Lesson 47301AI Algorithmic-Pricing Fairness Narrative: Drafting Disparate-Impact Memos
○Lesson 47303AI Vendor AI-Risk-Assessment Narrative: Drafting Procurement-Stage Memos
○Lesson 47304AI Incident Disclosure-to-Users Narrative: Drafting Notification Letters
○Lesson 47305AI Political-Microtargeting Policy Narrative: Drafting Platform-Policy Memos
○Lesson 47306AI Deepfake-Image Takedown Narrative: Drafting Non-Consensual-Intimate-Image Responses
○Lesson 47307AI Research-Data Secondary-Use Narrative: Drafting Reuse-Justification Memos
○Lesson 47308AI Children's-Data COPPA-Treatment Narrative: Drafting Verifiable-Parental-Consent Memos
○Lesson 47309AI Sanctions-Screening False-Match Narrative: Drafting Customer-Communication Memos
○Lesson 49300AI Model Deprecation User-Impact Narrative: Drafting Sunset-Communication Summaries
○Lesson 49301AI Synthetic Data Consent Narrative: Drafting Consent-Inheritance Summaries
○Lesson 49302AI Content Attribution Policy Narrative: Drafting Newsroom Disclosure Summaries
○Lesson 49304AI Child Safety Evaluation Coverage Narrative: Drafting Threat-Model Coverage Summaries
○Lesson 49306AI Open-Weights Release Risk Narrative: Drafting Pre-Release Risk-Acceptance Summaries
○Lesson 49307AI Red-Team Finding Coordinated Disclosure Narrative: Drafting Vendor-Notification Summaries
○Lesson 49309AI Researcher Access Program Governance Narrative: Drafting Access-Tier Justification Summaries
○Lesson 51300AI Model Card Draft: Drafting With Human Oversight
○Lesson 53300AI and an AI-use disclosure template
○Lesson 53301AI and a bias pre-mortem checklist
○Lesson 53302AI and a data-minimization review
○Lesson 53303AI and a consent-form readability rewrite
○Lesson 53305AI and a stakeholder impact map
○Lesson 53307AI and a vendor AI due-diligence questionnaire
○Lesson 53308AI and a red-team prompt set
○Lesson 53309AI and a decision-rights doc for AI features
○Lesson 55301AI and Fairness Metric Selection Memo: Tradeoff Walkthrough
○Lesson 55304AI and Data Minimization Audit: Trimming the Training Set
○Lesson 55305AI and Evaluation Set Coverage Gaps: What's Missing From the Test
○Lesson 55307AI and Redress Mechanism Design Prompt: User Appeal Pathways
○Lesson 55309AI and Impact Assessment Stakeholder List: Who Should Be Heard
○Lesson 57302AI and Data Deletion Policies: User-Right Workflows
○Lesson 57305AI and Bias Audit Checklists: Pre-Deployment Reviews
○Lesson 57306AI and AI Incident Response Plans: When Models Misbehave
○Lesson 57307AI and Vendor AI Risk Questionnaires: Procurement Drafts
○Lesson 57308AI and AI Governance Charters: Cross-Functional Oversight
○Lesson 59300AI and Attribution Trails for Remix: Crediting the Whole Chain
○Lesson 59303AI and Revenue Share with Collaborators: Splits That Survive Success
○Lesson 59304AI and Audience Data Minimum-Viable Collection: Less Is Less Risk
○Lesson 59305AI and Likeness Licensing Language: Renting Your Face Without Losing It
○Lesson 59306AI and Audience Vulnerability Flags: Knowing Who's Watching
○Lesson 59307AI and Deepfake-of-Self Policies: Setting House Rules for Your Face
○Lesson 59308AI and Sponsorship Disclosure Checks: FTC-Proofing Every Post
○Lesson 59309AI and Anonymity Protection for Sources: De-Identifying Quotes
○Lesson 59310AI and Platform TOS Friction Mapping: Knowing the Rules That Bite
○Lesson 59311AI and the Criticism vs Harassment Line: Pre-Publication Pulse Check
○Lesson 59312AI and Correction and Retraction Flow: Owning Mistakes in Public

Curriculum
·
Creators
·
Ethics & Society
·
Feature Discovery in LLMs

Lesson 859 of 2116

Feature Discovery in LLMs

A feature is a direction in activation space that corresponds to a concept. Finding them — naming them, ranking them, connecting them — is one of the central activities of interpretability research.

CreatorsEthics & Society~22 min readAdvancedResearcherBI5 · Societal ImpactBI3 · LearningPrint / PDF

Big idea

A feature is a direction in activation space that corresponds to a concept. Finding them — naming them, ranking them, connecting them — is one of the central activities of interpretability research.

Lesson map

What this lesson covers

37 min13 blocks3 concepts

Learning path

The main moves in order

1What Counts as a Feature
2feature
3activation
4interpretability

Concept cluster

Terms to connect while reading

featureactivationinterpretability

Read2

Sections3

Lists2

Notes4

Quotes1

Terms1

Section 1

What Counts as a Feature

Informally, a feature is anything the network appears to 'represent' in its activations. More formally, it's a direction in activation space whose magnitude correlates with the presence of some input property — a word, a concept, a category, a sentiment.

Ways researchers find them

1Activation maximization: find inputs that make a neuron or direction fire hardest
2Sparse autoencoders: learn a dictionary of features from activations
3Probing: train a small classifier to detect a concept from activations
4Difference in means: compare activations across two labeled sets
5Steering vectors: find a direction that, when added to activations, changes output in a specific way

Landmark examples

Olsson et al. 2022 discovered 'induction heads' — attention patterns that implement in-context learning. Nanda et al. 2023 reverse-engineered how a small transformer does modular addition. Marks and Tegmark 2023 found directions for 'true' vs 'false' in LLM activations.

Check-in 1. Got it so far?

Named features in the wild (as of ~2024-2025)

Truthfulness and deception directions
Refusal features that activate before a safety refusal
Sycophancy features tied to agreeing with users
Entity-tracking features for names and pronouns
Code-vulnerability features
Sandbagging-related features

Interpretation is hard

A feature that fires on 'deception examples' in training data may be a 'this looks like a deception example' feature, not an internal 'I am being deceptive' feature. The naming is the researcher's, not the model's.

“Every feature we name is a hypothesis. The test is whether intervening on it does what we expect.”
Neel Nanda, DeepMind

Check-in 2. Got it so far?

Key terms in this lesson

feature
probing
activation maximization
steering

The big idea: features are the unit of analysis in modern interpretability. Learning to read them is the closest thing the field has to learning to read code.

Key insight

A feature is a direction in activation space that corresponds to a concept. Finding them — naming them, ranking them, connecting them — is one of the central activities of interpretability research. The best way to learn is to practice.

Check-in 3. Got it so far?

Lesson complete

You've completed "Feature Discovery in LLMs". Mark this lesson done and keep going — every lesson builds on the last.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Feature Discovery in LLMs”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Your question

Try one:

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going

Creators · 55 min
Mechanistic Interpretability: Reading the Model's Mind
Sparse autoencoders, features, circuits. How researchers try to see what a model actually thinks, and why it may be the most strategically important safety work.
Builders · 28 min
Circuits in Neural Networks
A circuit is a small sub-network inside a big model that implements one specific behavior. Finding circuits is how researchers prove how a model does what it does.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.

Previous: Sparse Autoencoders Explained

Probing: Linear, Nonlinear, and Contrast: Next

Report an error

Reading mode