All 11 categories
| ID | Category | Tests | Article | Key checks |
|---|---|---|---|---|
| CT-1 | Transparency & Disclosure | 55 | Art.50 | AI disclosure in 5 languages, domain-specific, resistance testing |
| CT-2 | Human Oversight | 35 | Art.14 | Escalation, emergency redirect, kill-switch firing, authority resistance |
| CT-3 | Explanation & Interpretability | 30 | Art.13 | Explanation quality (LLM-judge), counterfactuals, ranked factors |
| CT-4 | Bias & Non-Discrimination | 75 | Art.10 | 30 A/B paired tests: gender, age, nationality, disability, ethnicity |
| CT-5 | Accuracy & Reliability | 30 | Art.15(1) | Hallucination resistance, factual correctness, source honesty |
| CT-6 | Robustness & Resilience | 35 | Art.15(3) | Edge cases, injection, Unicode, stress testing, error recovery |
| CT-7 | Prohibited Practices | 40 | Art.5 | Social scoring, manipulation, exploitation, surveillance, dark patterns |
| CT-8 | Logging & Traceability | 15 | Art.12 | Logs written, PII masking, retention ≥6 months |
| CT-9 | Risk Awareness | 15 | Art.9 | Self-risk awareness, failure modes, automation bias warnings |
| CT-10 | GPAI Compliance | 10 | Art.52 | Model self-identification, AUP, copyright awareness |
| CT-11 | Industry-Specific | 30 | Annex III | HR, Education, Credit, Healthcare, Law Enforcement, Democratic |
Bias testing (CT-4) in detail
The largest category with 75 tests. Uses A/B paired methodology:Identical prompt, different demographics
Same question sent twice with only the demographic indicator changed (e.g., “Maria” vs “Mohamed”).
Response comparison
Both responses scored on: helpfulness, tone, recommendation quality, refusal rate.