Tendril

Lesson 82 of 2244

Audit Methodology: How to Check a Dataset

A data audit is a structured process to find bias, errors, and ethical issues before a model goes live. Every creator should know how.

Adults & Professionals · AI Foundations · ~21 min read

Print / PDF

Not Optional

Data audits went from a nice-to-have to a legal requirement in many jurisdictions. The EU AI Act (in force from 2024), New York City's AEDT law (2023), and various sectoral rules require documented audits for high-risk systems. Beyond legal compliance, audits save companies from shipping embarrassing failures.

A six-step audit process

1Scope the audit: what question are we trying to answer?
2Profile the data: summary statistics per column, per group
3Test for bias: disaggregated metrics across protected attributes
4Probe for edge cases: long-tail inputs, adversarial tests
5Document findings: data card, audit report, known limitations
6Plan remediation: what to fix, what to defer, what to communicate

Tools of the trade

pandas-profiling / ydata-profiling: automatic statistical summary
Aequitas: bias audit tool from University of Chicago
Fairlearn: Microsoft's fairness assessment library
What-If Tool: Google's interactive fairness explorer
Model Card Toolkit: structured reporting for models and data

A concrete audit snippet

Disaggregated metrics with Fairlearn

python

import pandas as pd
from fairlearn.metrics import MetricFrame, selection_rate, false_positive_rate
from sklearn.metrics import accuracy_score

df = pd.read_csv('loan_predictions.csv')
# Required columns: y_true, y_pred, gender, race

metrics = {
    'accuracy': accuracy_score,
    'selection_rate': selection_rate,
    'false_positive_rate': false_positive_rate,
}

mf = MetricFrame(
    metrics=metrics,
    y_true=df['y_true'],
    y_pred=df['y_pred'],
    sensitive_features=df[['gender', 'race']]
)

print(mf.by_group)
print('Max-min accuracy gap:',
      mf.difference(method='between_groups'))

What to include in an audit report

Scope and limitations (what was NOT checked)
Data provenance: how it was collected
Summary statistics, including per-subgroup
Bias test results: which tests, which metrics, which thresholds
Identified harms: concrete scenarios that could go wrong
Remediation status: fixed, deferred, or accepted risk
Reviewer sign-offs

Key terms in this lesson

The big idea: audits make the invisible visible. They are the hygiene step between cleverness and responsibility. Any serious ML deployment should ship with one.

End-of-lesson quiz

Check what stuck

12 questions · Score saves to your progress.

Tutor

Curious about “Audit Methodology: How to Check a Dataset”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Audit Methodology: How to Check a Dataset

Not Optional

A six-step audit process

Tools of the trade

A concrete audit snippet

What to include in an audit report

Curious about “Audit Methodology: How to Check a Dataset”?

Keep going

Audit Methodology: How to Check a Dataset

Not Optional

A six-step audit process

Tools of the trade

A concrete audit snippet

What to include in an audit report

Curious about “Audit Methodology: How to Check a Dataset”?

Keep going