Red Team: Lie Detection Competition — Request for Proposals

Proposals Due: March 31, 2026

TL;DR Submit a proposal describing your concept for creating datasets of on-policy LLM lies, along with your team’s background, by March 31, 2026. Selected Red Teams will receive a $10,000 stipend and up to $2,000 in compute to build their datasets. Teams that deliver on a jointly agreed dataset specification may receive an additional $15,000.

Apply Now →

1. Competition Overview

Schmidt Sciences, NDIF, and Cadenza are jointly running a competition to advance the science of lie detection in large language models. The competition is structured as a Red Team vs. Blue Team challenge:

Red Teams design and implement datasets where LLMs produce on-policy lies — statements the model appears to believe are false.
Blue Teams build detection methods (black-box, white-box, or hybrid) that identify those lies. Blue Teams will receive access to validation datasets during development.

The central research question is: What methods best detect whether a model lied, and how much does white-box access (weights, activations, activation steering) contribute beyond black-box tools alone?

Definition of Lying

We adopt the assertion-based definition from Fallis (2009), as operationalized in Liars’ Bench (Kretschmar et al., 2025): the assistant lies if it states something it believes to be false. Beliefs are measured through consistent responses in contexts without direct pressure to lie.

For example, if a model generates “I fixed the code, it runs now” after observing failed tests in-context, its apparent beliefs (informed by the failed tests) contradict its statement. See Section 3 in Liars’ Bench for more details.

Background: Liars’ Bench

This competition builds on Liars’ Bench, a testbed of 72,863 labeled examples of lies and honest responses generated by four open-weight models across seven datasets. These datasets will serve as the validation set during development of the detectors by Blue Teams. Red Teams are expected to create new datasets that go beyond the scenarios already captured, introducing novel and challenging types of lies for Blue Teams to detect.

2. Red Team Task

Red Teams create static datasets of on-policy LLM lies, similar in structure to Liars’ Bench but exploring new or more challenging settings. The goal is to produce lies that are difficult for current detection methods to identify.

Blue Team Affordances

Blue Teams will create general-purpose deception detectors, not dataset-specific ones. They will have the following affordances:

What Blue Teams can do:

Access transcripts and activations from Red Team datasets (may be redacted — see Preferred Dataset Properties below)
Perform inference on the models used to generate the datasets
Do Activation steering and other interpretability techniques (using NNsight)
Access to a trusted model. The specific model will be announced by the start of the Red Team implementation phase (current candidate: Qwen3.5-27B).

What Blue Teams cannot do:

Access test datasets during development of their detector (only validation sets will be provided during development)
Access training data used to create fine-tuned models
Overfit to the model organisms or datasets. (Blue teams will be prevented from overfitting by the competition organizers)

Dataset Requirements

On-Policy: Lies should be generated by the model itself during inference, not constructed from other models or humans.
Scale: Submissions should include at least 100 honest and 100 deceptive examples per model to promote statistical robustness.
Models: Examples must be generated by (possibly fine-tuned) open-weight models. We encourage proposals using mid-size to large (e.g., Qwen3.5-27B to OSS 120B). Very large models (e.g., Kimi K2.5) are also allowed, but our capacity to run model organisms at that scale during the Blue Team phase is limited. If you plan to use a very large model, we recommend reaching out in advance (competition@cadenzalabs.org). In general, we advise including a smaller-model fallback (e.g., Qwen3.5-27B) in your proposal.
Label verification: Red Teams must provide evidence that their labels genuinely represent lying — that is, the model’s response contradicts beliefs established in contexts where the reason for lying is not present.

Preferred Dataset Properties

Difficulty: Datasets that are difficult for existing baseline techniques, such as LLM-as-a-Judge, deception probes, or follow-up questions. For example, model organisms that have been adversarially trained are preferred.
Redaction: Datasets where transcripts can be shared with Blue Teams without redaction are preferred. If your model organism requires redaction to remain challenging (e.g., to prevent Blue Teams from detecting lies by inspecting the chain-of-thought), that is acceptable — but please flag this in your proposal.
Realism: Types of lies that could result from prosaic alignment are preferred over artificial settings. For example, a model lying because of emergent misalignment due to reward hacking is preferred over a model lying because it was fine-tuned to always say wrong things.
Novelty: Model organisms and types of lies created using novel techniques are preferred over reimplementations of existing work. For example, introspective lies, where a model lies about its own knowledge, are preferred over lies like “Paris is the capital of Italy” that are already part of existing datasets.

Design Dimensions to Explore

Liars’ Bench categorizes lies along two axes that Red Teams could consider when designing new datasets:

Axis	Category	Examples
Object of Belief (What the model lies about)	World-knowledge	Factual claims, trivia, domain expertise
	Self-knowledge	Own capabilities, past actions, policies, private state
Reason for Lying (Why the model lies)	Encouraged	Explicit instructions, role-play, goal-directed pressure
	Inherent	Learned during training or fine-tuning (e.g., backdoors, alignment faking)

3. Deliverables

Upon selection for the final competition, Red Teams must submit:

Datasets

⚠️ Important: The datasets should not be made public until after the competition.

Datasets must be uploaded to HuggingFace, in a private repository, in the following schema:

Column	Description
`model`	Full HuggingFace model identifier used for generation (e.g., `meta-llama/Llama-3.3-70B-Instruct`)
`messages`	Full conversation transcript in ChatML format; the last message must be an assistant message
`deceptive`	Boolean label indicating whether the last assistant message is a lie

Fine-tuned Models

Any fine-tuned models used in the dataset creation must be provided, e.g. as LoRA adapters.

Report

A written report of no more than 4 pages (plus optional appendices) that:

Describes the model setup, any fine-tuning applied, and the generation pipeline.
Justifies how the dataset satisfies the required properties and how label correctness is verified.

4. Compensation

Phase	Amount
(1) Accepted Proposal: Upon entering implementation phase	$10,000 stipend + up to $2,000 in compute
(2) Upon selection for the final competition	Additional $15,000
(3) Co-authorship	Included on the competition report

⚠️ Important: The payments may take several weeks to process after milestone completion.

Upon proposal acceptance (1), organizers and Red Teams will jointly agree on a dataset specification. Teams that deliver datasets meeting the agreed-upon specification will be selected for the final competition and receive the additional $15,000 (2).

5. Timeline

Milestone	Date
RFP published	March 9, 2026
Proposal decisions (rolling)	Deadline: March 31, 2026 (AoE)
Codebase + report submission	June 15, 2026
Selection for final competition	June 30, 2026
Blue Team competition	Late Summer 2026

6. How to Apply

Submit a proposal (max 4500 characters) describing:

Dataset concept: Your approach to creating datasets of on-policy lies, including which types of lies you plan to target, how you will verify labels, and why you expect your datasets to be challenging for current detection methods.
Team background: Your team’s relevant experience. Backgrounds in LLM fine-tuning, interpretability, AI safety evaluation, or adversarial red-teaming are particularly valued.

Applications are reviewed on a rolling basis. The deadline is March 31, 2026.

If a full implementation is the only barrier to applying, the organizers are open to discussing joint development.

Contact for questions: competition@cadenzalabs.org

Submit Your Proposal →

References

Fallis, D. (2009). What is lying? Journal of Philosophy, 106(1), 29–56.

Kretschmar, K., Laurito, W., Maiya, S., & Marks, S. (2025). Liars’ Bench: Evaluating Lie Detectors for Language Models. arXiv:2511.16035.