<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://taidnguyen.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://taidnguyen.github.io/" rel="alternate" type="text/html" /><updated>2026-06-29T20:46:44+00:00</updated><id>https://taidnguyen.github.io/feed.xml</id><title type="html">Tai Nguyen</title><subtitle>Researcher and engineer</subtitle><entry><title type="html">Self-play on a Vintage Language Model</title><link href="https://taidnguyen.github.io/blog/self-play-1930-model/" rel="alternate" type="text/html" title="Self-play on a Vintage Language Model" /><published>2026-06-14T00:00:00+00:00</published><updated>2026-06-14T00:00:00+00:00</updated><id>https://taidnguyen.github.io/blog/self-play-1930-model</id><content type="html" xml:base="https://taidnguyen.github.io/blog/self-play-1930-model/"><![CDATA[<p>Two lines of LLM-related works have peaked my interests. The first is <a href="https://talkie-lm.com/introducing-talkie">talkie</a>, a 13B language model trained with data pre-dated 1931. The “vintage” model presents many unique angles for persona research and understanding LLM generalization. For instance, without seeing Python, talkie can get 10% of HumanEval questions, albeit with non-zero contamination risks and extreme inefficiency (pass@100). The second line of work catching my attention is much excitement around self-improving LLMs, with few papers that stand out: <a href="https://arxiv.org/abs/2604.20209">Stanford paper</a>, <a href="https://arxiv.org/abs/2505.03335">Absolute Zero</a>, and <a href="https://arxiv.org/abs/2510.24684">SPICE</a>. These share some common thread around models generating their own tasks, answering to the tasks via the rollout phase, and sometimes acting as its own verifier.</p>

<p>I spend the weekend trying to combine these two threads. The main motivation is to understand whether we can achieve a lift on model capability ✨ for free ✨ when facing a data-constraint setting. The authors estimated that <em>talkie</em> was trained only on 260B tokens. For simplicity, I;m also starting with reasoning. For example, can we get talkie to be a bit more efficient/better on HumanEval?</p>

<p>The goal is not benchmark-maxxing, as talkie has much better <a href="https://resobscura.substack.com/p/are-vintage-llms-the-start-of-a-new">uses</a> :) We just want to validate the idea.</p>

<p>For convenience, our corpora pull from Project Gutenberg, period-appropriate books that <em>talkie</em> has likely already seen in pretraining. We start with arithmetic, and later add logic and science to test breadth:</p>
<ul>
  <li>1894, First Steps in Algebra, Wentworth (arithmetic)</li>
  <li>1830, Elements of Arithmetic, De Morgan (arithmetic)</li>
  <li>1896, Symbolic Logic, Carroll (logic)</li>
  <li>1861, The Chemical History of a Candle, Faraday (science)</li>
</ul>

<h2 id="recipe-self-play-in-corpus-environments-spice">Recipe: Self-Play In Corpus Environments (SPICE)</h2>
<p>SPICE proposes a generalizable recipe that cleverly leverages the data. Specifically, the model conditions on a passage to synthesize its own (question, answer, gold) pairs, which it then reason about during a second forward pass. As Challenger it reads a document and writes a question with a gold answer. As Reasoner it then tries to answer that question without seeing the document. The same weights train on both jobs at once.</p>

<p>This is an ordinary policy-gradient setup. The base model is our policy $\pi_\theta$, and we maximize its expected reward $J(\theta)$ by gradient ascent on the weights $\theta$. $J$ adds the reward the model earns as Challenger and as Reasoner, averaged over the corpus.</p>

\[J(\theta) = \mathbb{E}_{d \sim D}\Big[\, \underbrace{\mathbb{E}_{(q,a^*)\sim \pi_\theta(\cdot\mid d,\,C)}\big[r_C\big]}_{\text{Challenger}} \;+\; \underbrace{\mathbb{E}_{\hat a \sim \pi_\theta(\cdot\mid q,\,R)}\big[r_R\big]}_{\text{Reasoner}} \,\Big]\]

<p>We draw a document $d$ from the corpus $D$. From that document the Challenger writes a question and gold pair $(q, a^*)$ and earns $r_C$. The Reasoner then answers the question and earns $r_R$. We want weights that make both terms large.</p>

<p>The Reasoner reward is the easy one. It is one when the answer matches the gold and zero otherwise.</p>

\[r_R = \mathbb{1}[\hat a = a^*]\]

<p>The Challenger reward drives a the core idea. We do not incentivize questions that are too easy nor too hard. The objective encourages question generation that lands right at the Reasoner’s edge. To find that edge, we let the Reasoner answer the same question $k$ times and mark each attempt right or wrong. Let $p$ be the fraction of those $k$ attempts that were correct. The spread of a yes or no outcome is its variance, $p(1-p)$. That spread is zero when the Reasoner gets the question every time ($p=1$) or never ($p=0$), and it is largest when the attempts split in half ($p=0.5$), where the variance equals $0.25$.</p>

<p>So the Challenger reward is a bump centered on that halfway point!</p>

\[r_C = \exp\!\left(-\frac{\big(\,p(1-p) - 0.25\,\big)^2}{2\sigma^2}\right)\]

<p>It is a Gaussian on the variance, equal to one at $p=0.5$. The width $\sigma$, a hyperparameter, sets how sharp the peak is ($\sigma^2 = 0.01$ in SPICE). Therefore, falloff can be steep:</p>
<ul>
  <li>4/8 split rewards 1</li>
  <li>6/8 rewards 0</li>
  <li>7/8 rewards 0.37</li>
  <li>8/8 rewards 0.04 (close to nothing)</li>
</ul>

<figure style="margin:26px 0;text-align:center;">
  <img src="/assets/images/variance_reward.png" alt="Challenger reward versus Reasoner pass rate, a bump peaking at 0.5" style="width:100%;max-width:460px;" />
  <figcaption style="font-size:12px;color:#777;margin-top:2px;">Reward for the Challenger against the Reasoner's pass rate. Answering the question correctly half the time produces highest reward (frontier).</figcaption>
</figure>

<p>Finally, we turn our reward into a gradient. Here, SPICE follows group policy by subtracting the baseline from the average reward of the group, and call the result the <em>advantage</em>.</p>

\[\hat A_i = r_i - \frac{1}{N}\sum_j r_j\]

<p>The policy gradient then nudges the model to make each sampled response more likely in proportion to its advantage,</p>

\[\nabla_\theta J = \mathbb{E}\big[\,\hat A_i \,\nabla_\theta \log \pi_\theta(\text{response}_i)\,\big]\]

<p>The DrGRPO advantage is to be distinguished from vanilla GRPO. GRPO divides the advantage by the group’s standard deviation, which quietly biases training toward low-variance prompts. <a href="https://arxiv.org/abs/2503.20783">“GRPO done right”</a> drops that normalization and keeps the plain mean-subtracted form. Importantly, we also center inside each role, Challenger against other Challengers and Reasoner against other Reasoners, since pooling the two would blur the comparison.</p>

<p>Here are some talkie outputs:</p>

<style>
.ex { border:1px solid #e6e6e6; border-radius:8px; margin:22px 0; overflow:hidden; font-size:13.5px; line-height:1.55; }
.ex-h { background:#f6f6f4; padding:8px 16px; font-weight:700; font-size:12px; color:#555; border-bottom:1px solid #ececec; }
.ex-b { padding:15px 16px; }
.ex-block { margin-bottom:14px; }
.ex-block:last-child { margin-bottom:0; }
.ex-lab { display:block; font-size:10px; letter-spacing:0.08em; text-transform:uppercase; font-weight:700; margin-bottom:4px; }
.ex-pass { color:#aaa; }
.ex-chal { color:#2f6fb5; }
.ex-reas { color:#b5402f; }
.ex-passage { font-family:Georgia, serif; font-style:italic; color:#555; border-left:3px solid #e2e2e2; padding-left:13px; }
.ex-phase { font-size:10.5px; letter-spacing:0.08em; text-transform:uppercase; font-weight:700; color:#999; margin:16px 0 11px; padding-top:13px; border-top:1px dashed #e3e3e3; }
.ex-gold { color:#aaa; }
.qag { display:flex; gap:8px; margin:3px 0; }
.qk { flex:0 0 34px; color:#999; font-weight:700; font-size:10px; text-transform:uppercase; letter-spacing:0.05em; padding-top:2px; }
.ex-prompt { font-family:'SFMono-Regular',Consolas,monospace; font-size:11.5px; line-height:1.5; color:#666; background:#f8f8f6; border:1px solid #ededed; border-radius:4px; padding:9px 11px; white-space:pre-wrap; }
.ex-cur { color:#2f6fb5; font-weight:700; }
</style>

<div class="ex">
  <div class="ex-h">Example #1 &middot;  Carroll, Symbolic Logic</div>
  <div class="ex-b">
    <div class="ex-block"><span class="ex-lab ex-pass">Our instruction</span><div class="ex-prompt">No fat creatures run well; some greyhounds run well.
Q. No fat creatures run well, and some greyhounds run well. Does it follow that some greyhounds are not fat? Options: (A) yes (B) no (C) cannot be determined
A. Some greyhounds run well, and no fat creature runs well, so those greyhounds are not fat. The answer is (A).

All dogs are mammals; all mammals are animals.
Q. All dogs are mammals, and all mammals are animals. Does it follow that all dogs are animals? Options: (A) yes (B) no (C) cannot be determined
A. Dogs are mammals and mammals are animals, so dogs are animals. The answer is (A).

No birds are fish; some pets are birds.
Q. No birds are fish, and some pets are birds. Does it follow that no pets are fish? Options: (A) yes (B) no (C) cannot be determined
A. Some pets are birds and no bird is a fish, but other pets could still be fish, so it does not follow. The answer is (C).

Every philosopher is logical; no logical person is obstinate.
Q. Every philosopher is logical, and no logical person is obstinate. Does it follow that no philosopher is obstinate? Options: (A) yes (B) no (C) cannot be determined
A. Philosophers are logical and the logical are not obstinate, so no philosopher is obstinate. The answer is (A).

[INSERT PASSAGE]
Q. <span class="ex-cur">▍</span> </div></div>
    <div class="ex-block"><span class="ex-lab ex-pass">Passage</span>For example, we may imagine that we have picked out, from the Class "Things," all the Things which possess the Adjunct "material, artificial, consisting of houses and streets"; and we may thus form the Real Class "towns." Here we may regard "Things" as a Genus, "Towns" as a Species of Things, and "material, artificial, consisting of houses and streets" as its Differentia.</div>
    <div class="ex-block"><span class="ex-lab ex-chal">Challenger</span><div class="qag"><span class="qk">Q</span><span>Does it follow that all Towns are material, artificial and consisting of houses and streets? &nbsp;(A) yes &nbsp;(B) no</span></div><div class="qag"><span class="qk">A</span><span>The answer is (A); for "Towns" are a Species of Things, and their Differentia agrees with the Genus.</span></div><div class="qag"><span class="qk">Gold</span><span>(A)</span></div></div>
    <div class="ex-block"><span class="ex-lab ex-reas">Reasoner</span>1. The answer is (A). 2. We have here a Division of the Genus Things into its Differentiae.</div>
  </div>
</div>

<p>To help the model, we prompt the model with few-shot samples (varied per train corpus). The task is intentionall simplfied to Yes/No/TBD. This is not terribly interesting, is hopefully within model capability. Here, the multiple-choice question format is followed, and the CoT is solid. However, there are two nuances that could be better: 1. Not terribly interesting, and 2. The question itself is not self-contained, meaning the Reasoner cannot solve it reasonably without the passage.</p>

<!-- <div class="ex">
  <div class="ex-h">Example #2 &middot; arithmetic, early vs late</div>
  <div class="ex-b">
    <div class="ex-block"><span class="ex-lab ex-pass">Prompt &middot; few-shot, no instructions</span><div class="ex-prompt">Q. What is (8 &minus; 2) &minus; (5 &minus; 3)?
Ans. 4

Q. What is 9 &times; 4?  Options: (A) 36  (B) 35  (C) 12
Ans. 36

  ⟨ a few such exemplars, free-form and multiple-choice ⟩

&lt;passage below&gt;
Q. <span class="ex-cur">▍</span>  <span style="color:#aaa;">talkie continues here</span></div></div>
    <div class="ex-block"><span class="ex-lab ex-pass">Passage &middot; De Morgan, Elements of Arithmetic</span><div class="ex-passage">9 &times; 8 means that 8 is to be taken 9 times, and is the same thing as 72. This is the product of 9 and 8, and is read nine into eight. When two numbers, with the foregoing signs, are the same, the sign = is put between them. Thus, that 7 and 5 make 12, is written 7 + 5 = 12.</div></div>
    <div class="ex-phase">Early &middot; step 25</div>
    <div class="ex-block"><span class="ex-lab ex-chal">Challenger</span><div class="qag"><span class="qk">Q</span><span>What is 12 &minus; (8 + 4)?</span></div><div class="qag"><span class="qk">A</span><span>8</span></div><div class="qag"><span class="qk">Gold</span><span>0 &nbsp;<span class="ex-gold">(computed)</span></span></div></div>
    <div class="ex-block"><span class="ex-lab ex-reas">Reasoner</span>4</div>
    <div class="ex-phase">Late &middot; step 375</div>
    <div class="ex-block"><span class="ex-lab ex-chal">Challenger</span><div class="qag"><span class="qk">Q</span><span>What is 4 &times; 2?</span></div><div class="qag"><span class="qk">A</span><span>8</span></div><div class="qag"><span class="qk">Gold</span><span>8 &nbsp;<span class="ex-gold">(computed)</span></span></div></div>
    <div class="ex-block"><span class="ex-lab ex-reas">Reasoner</span>8</div>
  </div>
</div>
<p style="font-size:11.5px;color:#999;margin:-4px 0 18px;">The arithmetic loop retreated rather than escalated. Early it reached for a two-step problem and missed, its own key included. Late it had fallen back to a one-step it could ace.</p> -->

<h2 id="results">Results</h2>

<p>Before the ablations dig in, the short version. On this data-constrained vintage model, corpus-grounded self-play does buy a real capability lift, and close to for free. Held-out arithmetic climbs over training and holds in the 0.6 to 0.7 range, up from around 0.4 to 0.5 at the start.</p>

<p>A handful of evals, step 0 of the run against its last step, locates the gain.</p>

<table>
  <thead>
    <tr>
      <th>Eval task</th>
      <th>step 0</th>
      <th>step n</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Arithmetic free form (in distribution)</td>
      <td>0.40</td>
      <td>0.70</td>
    </tr>
    <tr>
      <td>Arithmetic ranked classification (in distribution)</td>
      <td>0.53</td>
      <td>0.72</td>
    </tr>
    <tr>
      <td>Morse decode (near transfer)</td>
      <td>—</td>
      <td>—</td>
    </tr>
    <tr>
      <td>HumanEval pass@1 (far transfer)</td>
      <td>—</td>
      <td>—</td>
    </tr>
    <tr>
      <td>HumanEval pass@5 (far transfer)</td>
      <td>0.11</td>
      <td>0.11</td>
    </tr>
  </tbody>
</table>

<p>The two in-distribution rows move while near and far transfer stay flat, which is what a narrow loop on a weak model should do. It sharpens the trained skill without spilling into general reasoning.</p>

<p>You can watch the loop that earns it. The two roles stay in tension the whole way, the Challenger reward holding its mid band while the Reasoner reward and frontier rate sit near a half rather than running to zero or one.</p>

<figure style="margin:26px 0;text-align:center;">
  <img src="/assets/images/training_dynamics.png" alt="Challenger reward, Reasoner reward and frontier rate over training" style="width:100%;" />
  <figcaption style="font-size:12px;color:#777;margin-top:2px;">Training dynamics on the healthy oracle run. Challenger reward, Reasoner reward and frontier rate all hold their bands rather than saturating.</figcaption>
</figure>

<p>Two conditions decide whether that happens. The reward has to be verifiable. With computed gold the eval climbs, but when the model grades its own answers it collapses to about 0.3. And the lift stays in-band. Arithmetic improves while HumanEval and MMLU never move, so this is a gain on the trained skill, not on general reasoning.</p>

<p>The ablations below pin down each piece: where the gain comes from, what does not matter, and where the loop breaks.</p>

<h3 id="ablation-is-groundtruth-required-for-spice">Ablation: Is groundtruth required for SPICE?</h3>

<p>Since our chosen datasets might come with groundtruth (eg. Math textbooks) and self-computed gold (eg. simple arithmetics), we run a quick ablation to compare computed gold against self-generated gold, where the Challenger writes its own key.</p>

<p>When the Challenger wrote its own key, the key was right only 11% of the time, and 73% of the tasks it called agreed were wrong. The errors were not random. They sat on the two things the model fails at, distributing over parentheses and signs on negative results. The Challenger states a wrong answer, the Reasoner shares the same blind spot and agrees, $p$ goes to one, and the wrong value trains as the gold. A model cannot correct a mistake that both of its halves make.</p>

<div style="font-family:'SFMono-Regular',Consolas,monospace;font-size:12.5px;line-height:1.55;background:#fafafa;border:1px solid #ececec;border-radius:4px;padding:12px 14px;margin:18px 0;color:#333;white-space:pre-wrap;">Challenger   What is 6 × (4 + 2)?       its key 24       true value 36
Reasoner     24  24  24  24  24  24  24  18     (eight tries)</div>
<p style="font-size:11.5px;color:#999;margin:-8px 0 18px;">Both halves drop the parentheses to 6 × 4 and agree on 24, so the wrong value trains as the gold. The true answer 36 never appears.</p>

<p>With computed gold none of this happens, because the key is never invented.</p>

<p>The eval is the symptom, and the training signal shows the cause. Held-out arithmetic holds around 0.60 under computed gold and slides to 0.30 under self grading (left). Meanwhile the self-written key is right only about a tenth of the time, and the share of tasks where both halves agree on a wrong answer climbs past 0.7 (right). At the model’s current level, the oracle is not a luxury.</p>

<div style="display:flex;flex-wrap:wrap;gap:18px;margin:24px 0;align-items:flex-start;">
  <figure style="flex:1 1 260px;margin:0;text-align:center;">
    <img src="/assets/images/oracle_vs_self.png" alt="Held-out arithmetic, oracle holds while self grading erodes" style="width:100%;" />
    <figcaption style="font-size:11.5px;color:#777;margin-top:4px;">Held-out arithmetic. Same setup, only the gold source differs.</figcaption>
  </figure>
  <figure style="flex:1 1 260px;margin:0;text-align:center;">
    <img src="/assets/images/gold_integrity.png" alt="Self-grading collusion rises while gold accuracy stays low" style="width:100%;" />
    <figcaption style="font-size:11.5px;color:#777;margin-top:4px;">The self-grading run. Gold accuracy near 0.1, collusion climbing past 0.7.</figcaption>
  </figure>
</div>

<h3 id="ablation-does-the-advantage-form-matter">Ablation: Does the advantage form matter?</h3>

<p>Both forms reduce the Reasoner to a single number $p$, its chance of answering correctly, which feeds both the variance reward and the Reasoner’s own advantage. They differ only in how $p$ is obtained.</p>

<p>The <strong>rollout</strong> form samples $k$ answers and counts the hits, a Monte Carlo estimate:</p>

\[p_{\text{rollout}} = \frac{1}{k}\sum_{i=1}^{k}\mathbb{1}[\hat a_i = a^*], \qquad \hat a_i \sim \pi_\theta(\cdot \mid q)\]

<p><strong>correct_prob_norm</strong> skips the sampling. For a multiple-choice question it scores each option by its length-normalized log-likelihood and takes the softmax mass on the gold option:</p>

\[p_{\text{cpn}} = \frac{e^{\ell(a^*)}}{\sum_{o}\, e^{\ell(o)}}, \qquad \ell(o) = \frac{1}{|o|}\log \pi_\theta(o \mid q)\]

<p>It is the noiseless version of the same pass rate, the exact probability rather than a $k$-sample estimate of it.</p>

<p>No measurable difference. Both trained the loop the same way. The honest reason is that the comparison barely had room to show, since the model chose multiple choice only about 15% of the time and answered the rest free form, so the two scorers rarely disagreed. We log it as a null rather than a win for either.</p>

<!-- ### Does chain of thought help

talkie cannot be told to reason, it only completes text. So chain of thought has to be shown through the few shot examples rather than asked for. We wanted to know if demonstrating the steps before the answer lifts anything.

It is neutral and safe. Arithmetic landed at 0.60 with the steps shown and 0.60 without them, and it never triggered the runaway we worried about. Good to know it does no harm, but it is not a lever here. -->

<h3 id="reward-hacks">Reward hacks</h3>

<p>Self-play did discover a few paths of least resistance.</p>

<p>For instance, it learned to leak the answer into the question. Asked for an open task it would write something like “what is 7 plus 4, that is 11,” so the Reasoner only had to copy. We gate this by checking that the answer does not already sit in the question.</p>

<p>When we relaxed the match to accept free text, hoping to admit softer questions, it found the cleanest exploit of all. It made the question and the answer nearly identical, so any overlap check fired at $p$ equal to one.</p>

<div style="font-family:'SFMono-Regular',Consolas,monospace;font-size:12.5px;line-height:1.55;background:#fafafa;border:1px solid #ececec;border-radius:4px;padding:12px 14px;margin:18px 0;color:#333;white-space:pre-wrap;">Q.  ...5 nationalities, all boys, sit together... What nationality are they?
A.  ...5 nationalities, all boys, sit together... They are Wales, England, Scotland...</div>
<p style="font-size:11.5px;color:#999;margin:-8px 0 18px;">From the free text run. The answer just restates the question, so the matcher always passes.</p>

<p>This eventually collapsed the model and dropped HumanEval to 0.036. Any low entropy answer spacealso invites the same trick. Yes or no, and a bare “(A),” both got gamed, since a guess lands often enough to look like skill. We verify the gold against the source and shuffle the option positions.</p>

<p>Interestingly, SPICE contains a built-in reward hack prevention where, under the variance reward, copying the answer drives $p$ to one, and one sits at the bottom of the reward rather than the top.</p>

<h2 id="scaling-self-play-with-self-guidance">Scaling self-play with self-guidance</h2>

<p>The last question is the tempting one. Drop the oracle, let the model generate and grade and answer across many corpora, and hope breadth carries it. We pooled two more period books with the arithmetic pair, Lewis Carroll’s <em>Symbolic Logic</em> and Faraday’s <em>Chemical History of a Candle</em>.</p>

<p>The questions it wrote were often a delight. Reading those passages, talkie composed its own Carroll-style syllogisms (Example #1) and even candle chemistry, answered in the same register.</p>

<div style="font-family:'SFMono-Regular',Consolas,monospace;font-size:12.5px;line-height:1.55;background:#fafafa;border:1px solid #ececec;border-radius:4px;padding:12px 14px;margin:18px 0;color:#333;white-space:pre-wrap;">Challenger   What is the product of a candle burning?
Reasoner     The answer is Water.</div>
<p style="font-size:11.5px;color:#999;margin:-8px 0 18px;">From Faraday's candle lectures, answered without the passage.</p>

<p>The trouble was never the questions. It was the grading. A single arithmetic corpus under self grading already colludes about 73% of the time. Pool the four and that climbs to 0.99, the model agreeing with itself on nearly every wrong answer. Capability went with it and HumanEval fell from 0.107 to 0.036. More surface to hide in is not more signal.</p>

<p>You can watch the loop give up over training. Early on the Challenger reaches for two step problems and misses. Later it has retreated to single steps it can ace, which pays nothing and teaches nothing.</p>

<div style="font-family:'SFMono-Regular',Consolas,monospace;font-size:12.5px;line-height:1.55;background:#fafafa;border:1px solid #ececec;border-radius:4px;padding:12px 14px;margin:18px 0;color:#333;white-space:pre-wrap;">step 25    Challenger  What is 12 − (8 + 4)?    key 8   true 0    Reasoner  4
step 375   Challenger  What is 4 × 2?           key 8   true 8    Reasoner  8</div>
<p style="font-size:11.5px;color:#999;margin:-8px 0 18px;">The curriculum backed away from hard questions rather than climbing toward them, the opposite of what the variance reward intends.</p>

<h3 id="a-frozen-base-as-the-answer-key">A frozen base as the answer key</h3>

<p>If the problem is a corrupt answer key, fix the key, not the student. The cheap fix that imports no outside teacher is the model’s own untrained base. Before a task is allowed to count, the frozen base, the LoRA adapter switched off so it costs no extra weights, has to agree with the gold the policy wrote. A KL leash would not help here. Two policies a short distance apart still collude, because the base shares the blind spot, and a global leash would also strangle the arithmetic gains we run at β=0 to keep. The gate only filters drift, the part of collusion that runs away. Blind spots the base already has slip through, but those are bounded at the base error rate while drift is what compounds to 0.99.</p>

<p>It worked on the failure it was built for. Across 394 steps the agreement-on-wrong-answers rate sat near 0.05 to 0.14 and fell as training went on rather than climbing to 0.99. Self-written gold went from about two thirds correct to nineteen in twenty.</p>

<p>And then the loop went quiet. The fraction of questions landing at the frontier all but emptied, and the collusion rate fell not because the blind spot healed but because the questions got trivial. Late in training the Challenger is asking what is six times five, and the base, the policy and the gold all agree it is thirty.</p>

<div style="font-family:'SFMono-Regular',Consolas,monospace;font-size:12.5px;line-height:1.55;background:#fafafa;border:1px solid #ececec;border-radius:4px;padding:12px 14px;margin:18px 0;color:#333;white-space:pre-wrap;">early   Challenger  What is 6 × (4 + 2)?    gold 24   true 36
late    Challenger  What is 6 × 5?          gold 30   true 30</div>
<p style="font-size:11.5px;color:#999;margin:-8px 0 18px;">The gate killed the runaway, then the curriculum retreated to questions everyone could ace. Self-guidance buys safety, not capability.</p>

<div style="display:flex;flex-wrap:wrap;gap:18px;margin:24px 0;align-items:flex-start;">
  <figure style="flex:1 1 260px;margin:0;text-align:center;">
    <img src="/assets/images/gate_arith.png" alt="Held-out arithmetic, unguided self versus the gated run" style="width:100%;" />
    <figcaption style="font-size:11.5px;color:#777;margin-top:4px;">Held-out arithmetic. The gate keeps it from cratering, but does not push it up.</figcaption>
  </figure>
  <figure style="flex:1 1 260px;margin:0;text-align:center;">
    <img src="/assets/images/self_guidance_gate.png" alt="Collusion rate, unguided self compounds while the gate holds it low" style="width:100%;" />
    <figcaption style="font-size:11.5px;color:#777;margin-top:4px;">Collusion rate. Unguided self compounds toward 0.99; the gate holds it low.</figcaption>
  </figure>
</div>

<p>So the gate does what it promises and no more. It removes the runaway, but it cannot push difficulty upward, and a weak model with nothing pulling the frontier higher drifts to the easy questions it already knows. That is the same ceiling as everywhere else in this post. Capability came only from a key the model could not write for itself.</p>

<!-- ## What hid the signal

None of the hard parts failed loudly. Each looked the same from the outside, a flat metric that reads as the model cannot do this.

Format was the first wall. About a fifth of the answers were already correct, buried under a convention the model could not produce. Modern templates scored zero. The period catechism worked. We give it a passage, then `Q.`, and read what follows `Ans.`, the 1930 version of a boxed answer. That alone took the well formed rate from nothing to about a third.

Learning rate was a quieter one. Too high and the Reasoner outruns the weak Challenger, the success rate saturates, the variance signal dies around step 75, and the eval degrades with it. 1e-5 held a balanced loop for a full run. The same ceiling at 2e-5 collapsed sooner.

A completion model from 1930 does not stop at the answer. It writes the answer, then keeps going and starts inventing the next exercise in the textbook's voice.

<div style="font-family:'SFMono-Regular',Consolas,monospace;font-size:12.5px;line-height:1.55;background:#fafafa;border:1px solid #ececec;border-radius:4px;padding:12px 14px;margin:18px 0;color:#333;white-space:pre-wrap;">Reasoner   1. The answer is (A).
           Q. 2. Does it follow that every rule is valid? Options: (B) no.
           A.        ← it has begun writing the next exercise itself</div>
<p style="font-size:11.5px;color:#999;margin:-8px 0 18px;">It answers, then keeps composing the textbook.</p>

This is why two of our parsers lied. They grabbed the runaway continuation instead of the leading answer, so arithmetic read near zero for weeks when the truth was about 0.6. We stop the generation at the first boundary and read the leading number.

The real signal was there from the first day at about twenty percent. The work was removing the things hiding it, one at a time. Treat every flat metric as a broken instrument until the raw output proves otherwise. -->

<h2 id="references">References</h2>

<ul>
  <li>SPICE: Self-Play In Corpus Environments. <a href="https://arxiv.org/abs/2510.24684">arXiv:2510.24684</a></li>
  <li>Absolute Zero: Reinforced Self-play Reasoning with Zero Data. <a href="https://arxiv.org/abs/2505.03335">arXiv:2505.03335</a></li>
  <li>Understanding R1-Zero-Like Training (Dr. GRPO). <a href="https://arxiv.org/abs/2503.20783">arXiv:2503.20783</a></li>
  <li>Stanford paper. <a href="https://arxiv.org/abs/2604.20209">arXiv:2604.20209</a> <!-- title to confirm --></li>
  <li>talkie. <a href="https://talkie-lm.com/introducing-talkie">talkie-lm.com</a></li>
  <li>Are vintage LLMs the start of a new kind of history? <a href="https://resobscura.substack.com/p/are-vintage-llms-the-start-of-a-new">Res Obscura</a></li>
</ul>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[Lessons from applying SPICE on a pre-1931 language model.]]></summary></entry><entry><title type="html">DataDecide: How to Predict Best Pretraining Data with Small Experiments</title><link href="https://taidnguyen.github.io/blog/datadecide/" rel="alternate" type="text/html" title="DataDecide: How to Predict Best Pretraining Data with Small Experiments" /><published>2025-01-02T00:00:00+00:00</published><updated>2025-01-02T00:00:00+00:00</updated><id>https://taidnguyen.github.io/blog/datadecide</id><content type="html" xml:base="https://taidnguyen.github.io/blog/datadecide/"><![CDATA[]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">MMTEB: Massive Multilingual Text Embedding Benchmark</title><link href="https://taidnguyen.github.io/blog/mmteb/" rel="alternate" type="text/html" title="MMTEB: Massive Multilingual Text Embedding Benchmark" /><published>2025-01-01T00:00:00+00:00</published><updated>2025-01-01T00:00:00+00:00</updated><id>https://taidnguyen.github.io/blog/mmteb</id><content type="html" xml:base="https://taidnguyen.github.io/blog/mmteb/"><![CDATA[]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">In-context Example Selection with Influences</title><link href="https://taidnguyen.github.io/blog/icl_influences/" rel="alternate" type="text/html" title="In-context Example Selection with Influences" /><published>2024-01-01T00:00:00+00:00</published><updated>2024-01-01T00:00:00+00:00</updated><id>https://taidnguyen.github.io/blog/icl_influences</id><content type="html" xml:base="https://taidnguyen.github.io/blog/icl_influences/"><![CDATA[]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Explanation-based Finetuning Makes Models More Robust to Spurious Cues</title><link href="https://taidnguyen.github.io/blog/explanation_robust/" rel="alternate" type="text/html" title="Explanation-based Finetuning Makes Models More Robust to Spurious Cues" /><published>2023-01-02T00:00:00+00:00</published><updated>2023-01-02T00:00:00+00:00</updated><id>https://taidnguyen.github.io/blog/explanation_robust</id><content type="html" xml:base="https://taidnguyen.github.io/blog/explanation_robust/"><![CDATA[]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Big Data Bowl</title><link href="https://taidnguyen.github.io/blog/big_data_bowl/" rel="alternate" type="text/html" title="Big Data Bowl" /><published>2023-01-01T00:00:00+00:00</published><updated>2023-01-01T00:00:00+00:00</updated><id>https://taidnguyen.github.io/blog/big_data_bowl</id><content type="html" xml:base="https://taidnguyen.github.io/blog/big_data_bowl/"><![CDATA[<p>One of 5 finalists, winning $15,000. We got to meet the Director of Research of the NFL and had a professional video made.</p>]]></content><author><name></name></author><category term="project" /><summary type="html"><![CDATA[One of 5 finalists, winning $15,000. We got to meet the Director of Research of the NFL and had a professional video made.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://taidnguyen.github.io/images/DataBowlScreenshot.png" /><media:content medium="image" url="https://taidnguyen.github.io/images/DataBowlScreenshot.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Software Entity Recognition with Noise-robust Learning</title><link href="https://taidnguyen.github.io/blog/software_ner/" rel="alternate" type="text/html" title="Software Entity Recognition with Noise-robust Learning" /><published>2023-01-01T00:00:00+00:00</published><updated>2023-01-01T00:00:00+00:00</updated><id>https://taidnguyen.github.io/blog/software_ner</id><content type="html" xml:base="https://taidnguyen.github.io/blog/software_ner/"><![CDATA[]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Underthesea</title><link href="https://taidnguyen.github.io/blog/underthesea/" rel="alternate" type="text/html" title="Underthesea" /><published>2022-01-02T00:00:00+00:00</published><updated>2022-01-02T00:00:00+00:00</updated><id>https://taidnguyen.github.io/blog/underthesea</id><content type="html" xml:base="https://taidnguyen.github.io/blog/underthesea/"><![CDATA[<p>Contributed a small amount to an open-source Vietnamese toolkit built by the amazing <a href="https://github.com/rain1024">Anh Vu</a>. This helped me get started on NLP.</p>]]></content><author><name></name></author><category term="project" /><summary type="html"><![CDATA[Contributed a small amount to an open-source Vietnamese toolkit built by the amazing Anh Vu. This helped me get started on NLP.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://taidnguyen.github.io/images/underthesea.png" /><media:content medium="image" url="https://taidnguyen.github.io/images/underthesea.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">STEAM For Vietnam</title><link href="https://taidnguyen.github.io/blog/steam_for_vietnam/" rel="alternate" type="text/html" title="STEAM For Vietnam" /><published>2022-01-01T00:00:00+00:00</published><updated>2022-01-01T00:00:00+00:00</updated><id>https://taidnguyen.github.io/blog/steam_for_vietnam</id><content type="html" xml:base="https://taidnguyen.github.io/blog/steam_for_vietnam/"><![CDATA[<p>
During Covid, I volunteered for a non-profit that provides free online education for Vietnamese children. I worked on the data science team.
</p>]]></content><author><name></name></author><category term="project" /><summary type="html"><![CDATA[During Covid, I volunteered for a non-profit that provides free online education for Vietnamese children. I worked on the data science team.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://taidnguyen.github.io/images/s4vlogo.png" /><media:content medium="image" url="https://taidnguyen.github.io/images/s4vlogo.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Top 100</title><link href="https://taidnguyen.github.io/top100/" rel="alternate" type="text/html" title="Top 100" /><published>1997-01-01T00:00:00+00:00</published><updated>1997-01-01T00:00:00+00:00</updated><id>https://taidnguyen.github.io/top100</id><content type="html" xml:base="https://taidnguyen.github.io/top100/"><![CDATA[<p>Things that are probably worth doing before I die, with a sprinkle of memorable life events.</p>

<p><em>Last updated: April 9, 2025</em></p>

<p><strong>Total: <span id="total-complete"></span></strong></p>

<ol id="bucket-list">
  <li>✓ Explore Sơn Đoòng (<a href="/images/AfterWeddingCake.jpg">view below Wedding Cake</a>)</li>
  <li>✓ Explore Sơn Đoòng x2 (came back for Hang Va in 2024)</li>
  <li>✓ Swim in Ha Long Bay</li>
  <li>▢ Climb Fanxipan</li>
  <li>▢ Climb Mt. Rainier</li>
  <li>▢ Hike the Grand Canyon</li>
  <li>✓ Visit Switzerland</li>
  <li>▢ Visit 30 countries (14/30)</li>
  <li>▢ Visit North Pole or South Pole</li>
  <li>▢ See Machu Picchu</li>
  <li>▢ See the Pyramid</li>
  <li>▢ See the Petra</li>
  <li>▢ See the Alps</li>
  <li>✓ Raft in a river (Korea 2016)</li>
  <li>▢ Ski in Japan</li>
  <li>▢ Obtain open-water scuba diving license (2 dives away; favorite dive was <a href="/images/maldives_1.jpg">The Fish Factory</a> in the Maldives with my brother in 2017)</li>
  <li>✓ See aurora (through a camera <a href="/images/aurora.jpg">lens</a>)</li>
  <li>✓ Surf in Hawaii (make sure to take a lesson next time)</li>
  <li>✓ See an Arsenal game (<a href="https://thethaovanhoa.vn/bong-da-anh/viet-nam-17-arsenal-mua-ban-thang-tai-my-dinh-n20130717160301211.htm">Mỹ Đình, 2013</a>)</li>
  <li>✓ Drive a long road trip (212 miles, Boston → Burlington)</li>
  <li>✓ Speak at high school graduation</li>
  <li>▢ Get into a PhD program</li>
  <li>▢ Finish the PhD</li>
  <!-- <li>✓ Publish first CS paper</li> -->
  <li>✓ Publish a paper at ACL</li>
  <!-- <li>✓ Get a publication citation</li>
  <li>✓ Get 50 publication citations</li>
  <li>▢ Get 200 publication citations</li> -->
  <li>▢ Get cited by mainstream news</li>
  <li>▢ Give an academic talk</li>
  <li>▢ Publish an open-source library</li>
  <!-- <li>▢ 500 Twitter followers</li> -->
  <li>✓ Win a Hackathon (x2)</li>
  <li>▢ Win a boxing spar (an amateur <a href="https://drive.google.com/file/d/1cVNtAFILR0Lqnm6aOrkhVwt_4uJ6Uco7/view?usp=sharing">combo snippet</a>)</li>
  <li>✓ Fall in love</li>
  <li>▢ Win a club tennis match</li>
  <li>▢ Bench 2 plates</li>
  <li>▢ Complete a half-marathon</li>
  <li>✓ Win college rugby league (<a href="http://haverfordclerk.com/tag/angry-newts/">Two-time champion</a> with the Angry Young Newts)</li>
  <li>▢ Attend Wimbledon</li>
  <li>▢ Attend Roland Garros</li>
  <li>✓ Watch a Seahawks game (Lincoln Field 2019, Levi's Stadium 2025)</li>
  <li>▢ Watch a boxing match</li>
  <li>▢ Watch a pro League game</li>
  <li>▢ Go to a Billy Joel concert</li>
  <li>✓ Go to a Paul McCartney concert (Thank you Apple for turning 50th)</li>
  <li>▢ Learn the piano (took lessons for 2 months)</li>
  <li>▢ Take a dance class</li>
  <li>▢ Meet a top ATP tennis player (ideally Federer/Nadal/Del Potro/Cilic/Kyrgios)</li>
  <li>✓ Meet a president</li>
  <li>▢ Live by a lake or ocean</li>
  <li>▢ Buy my parents a vacation</li>
  <li>▢ Start a company</li>
  <li>▢ Cook Phở with ease</li>
  <li>✓ Achieve Diamond+ rank in TFT/League (<a href="https://lolchess.gg/profile/na/Agrestic-NA1/set6/lp_history">proof</a>)</li>
  <li>✓ Go to the Super Bowl</li>  
</ol>

<script>
  // Count completed tasks
  const listItems = document.querySelectorAll("#bucket-list li");
  const completed = Array.from(listItems).filter(li => li.textContent.trim().startsWith("✓")).length;
  const total = listItems.length;
  document.getElementById("total-complete").textContent = `${completed}/${total}`;
</script>]]></content><author><name></name></author><summary type="html"><![CDATA[Things that are probably worth doing before I die, with a sprinkle of memorable life events.]]></summary></entry></feed>