Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie
Cornell University
$^*$ Equal contribution
✨ TL;DR: We introduce a novel framework that trains two nearly identical models adversarially, enabling them to co-evolve through iterative competition without any external supervision. We further enhance training using pre-training corpora and complementary training paradigms. Our method boosts average performance by ~13 points on Qwen3-1.7B-Base and ~16 points on Qwen3-4B-Base on math**.** 🌐 Website, 👨💻 Github, 🤗 HF Model, 📈 Wandb Logs

Training dynamics of the Solver’s performance across different training steps using Qwen3-1.7B evaluated on MATH-500. The Solver’s overall accuracy improves from 45% to 67% without any supervision. It surpasses the base model before step 20 and achieves its best performance of 67% at step 360. Importantly, PasoDoble is able to sustain improvements for hundreds of update steps, showing a much stronger scaling capacity than the related work of R-Zero [2].
Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Self-play offers a promising alternative that enables models to iteratively learn from themselves—thus reducing reliance on external supervision. GAN [1] offers a possibly even more compelling training paradigm by training two models adversarially: one model is dedicated to generating challenging tasks or adversarial examples, while the other focuses on solving them. Here is the question: can LLMs also be trained like GANs? The hope is that the specialized roles of each model foster sustained competition and mutual evolution, thereby enabling them to solve tasks that a single model may be fundamentally insufficient to handle.
In this paper, we introduce PasoDoble, a novel GAN-style training framework for LLMs. PasoDoble adversarially trains two nearly identical models: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We further enrich the training by leveraging high-quality math pre-training corpora and introducing the offline training paradigm to mitigate the potential training instability. Notably, PasoDoble operates without supervision during training.
| Model | AIME 2024 | AIME 2025 | AMC | GSM8K | MATH 500 | Olympiad Bench | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen3-1.7B-Base | 2.22 | 1.67 | 19.58 | 74.54 | 48.73 | 14.32 | 26.84 |
| + PasoDoble | 7.22 | 7.22 | 40.83 | 84.98 | 68.50 | 28.79 | 39.59 (+12.75) |
| Qwen3-4B-Base | 6.11 | 2.78 | 33.33 | 84.07 | 61.37 | 23.98 | 35.27 |
| + PasoDoble | 18.89 | 18.89 | 53.33 | 91.82 | 82.17 | 42.27 | 51.23 (+15.96) |
We sample six responses for each problem and report pass@1 accuracy. The base models are evaluated using 4-shot prompting following the Qwen technique report [3]. Other models are evaluated using 0-shot prompting.
We find that PasoDoble boosts the average performance by ~13 points on Qwen3-1.7B-Base and ~16 points on Qwen3-4B-Base on math without any supervision.

PasoDoble comprises four components: the Proposer $\pi_p$, the Solver $\pi_s$, the Knowledge Base $\mathcal{K}$, and, for offline training, the Question Buffer. Both the Proposer and the Solver are initialized from the same pretrained model. We then perform an initial cold start on them.
In online training, at each iteration, a knowledge piece is sampled from the Knowledge Base (1) to prompt the Proposer to generate question–answer (QA) pairs (2), which the Solver then attempts to solve with multiple solutions (3-4). The Solver receives a correctness reward based on agreement with the Proposer’s answer (5). To assess question difficulty, we compute the Solver’s accuracy per question (6) and define the Proposer’s difficulty reward inversely with this accuracy (7), while a diversity reward encourages novel questions (8). These rewards are combined to yield the Proposer’s final reward (9). Only valid questions with non-trivial difficulty are retained for Solver training (10). Both models are updated synchronously whenever at least one valid question is available (11), forming an online training loop.
In offline training, the Proposer is first updated for several steps (11) while the Solver is frozen, and valid questions are stored in a Question Buffer (12). The Proposer is then frozen, and the Solver is updated on the buffered questions (13), constructing its training dataset.
✨TL;DR: The Proposer is rewarded for generating questions that are hard (low Solver accuracy) and diverse (not similar to recent ones), but only if the question is valid and well-formed.
The Proposer’s job is to generate math questions that are challenging and diverse. To guide this behavior, we design a reward composed of two parts: difficulty and diversity.