<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://toooold.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://toooold.com/" rel="alternate" type="text/html" /><updated>2026-05-07T16:18:35+00:00</updated><id>https://toooold.com/feed.xml</id><title type="html">Toooold</title><subtitle>to code</subtitle><entry><title type="html">Can an LLM Formally Verify Your Code?</title><link href="https://toooold.com/2026/05/07/code-to-lean.html" rel="alternate" type="text/html" title="Can an LLM Formally Verify Your Code?" /><published>2026-05-07T00:00:00+00:00</published><updated>2026-05-07T00:00:00+00:00</updated><id>https://toooold.com/2026/05/07/code-to-lean</id><content type="html" xml:base="https://toooold.com/2026/05/07/code-to-lean.html"><![CDATA[<p>When a language model tells you “this function is correct,” how much should you trust it? The answer is: not very much — unless the claim comes with a machine-checked proof. This post describes a pipeline that asks an LLM to translate a Python function into Lean 4, proposes a correctness theorem, and then runs five independent gates to decide whether to trust the result. The point is not the translation; it’s the gates.</p>

<h2 id="why-lean">Why Lean?</h2>

<p><a href="https://lean-lang.org/">Lean 4</a> is a dependently typed theorem prover and programming language. Like Coq or Isabelle, it lets you write mathematical proofs that a type-checker verifies mechanically. Unlike those systems, Lean 4 also has a usable extraction/evaluation story and a growing standard library (<code class="language-plaintext highlighter-rouge">Std4</code>, <code class="language-plaintext highlighter-rouge">Mathlib</code>).</p>

<p>The key property we care about: <strong>Lean cannot lie about its own axioms.</strong> Every theorem Lean accepts is either provable from a known-good axiom set or contains <code class="language-plaintext highlighter-rouge">sorry</code> (Lean’s escape hatch, analogous to <code class="language-plaintext highlighter-rouge">admit</code> in Coq). Running <code class="language-plaintext highlighter-rouge">#print axioms theorem_name</code> after a successful compile reveals the full axiom dependency set. If <code class="language-plaintext highlighter-rouge">sorryAx</code> appears there, the proof is a placeholder — Lean “accepted” it the way a compiler accepts <code class="language-plaintext highlighter-rouge">todo!()</code> in Rust.</p>

<p>The trusted axiom set for computational theorems is small: <code class="language-plaintext highlighter-rouge">{propext, Classical.choice, Quot.sound}</code>. Anything beyond that is suspect.</p>

<h3 id="a-one-minute-lean-example">A one-minute Lean example</h3>

<p>Here is a simple function and its correctness theorem in Lean 4:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">addOne</span> (<span class="n">n</span> : <span class="n">Nat</span>) : <span class="n">Nat</span> := <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span>

<span class="k">theorem</span> <span class="n">addOne_spec</span> (<span class="n">n</span> : <span class="n">Nat</span>) : <span class="n">addOne</span> <span class="n">n</span> <span class="o">=</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> := <span class="k">by</span>
  <span class="n">unfold</span> <span class="n">addOne</span>
  <span class="n">rfl</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">rfl</code> closes the goal because both sides reduce to the same expression. The type-checker verifies this without trusting the programmer’s intuition. Now consider a more interesting statement:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">theorem</span> <span class="n">addOne_pos</span> (<span class="n">n</span> : <span class="n">Nat</span>) : <span class="mi">0</span> <span class="o">&lt;</span> <span class="n">addOne</span> <span class="n">n</span> := <span class="k">by</span>
  <span class="n">unfold</span> <span class="n">addOne</span><span class="o">;</span> <span class="n">omega</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">omega</code> is a decision procedure for linear arithmetic. The proof is still machine-checked; <code class="language-plaintext highlighter-rouge">omega</code> is just a tactic that applies the decision procedure and either closes the goal or fails.</p>

<p>The leap from toy arithmetic to real code is what the pipeline attempts to automate.</p>

<hr />

<h2 id="the-motivating-example-hmac-tag-comparison">The Motivating Example: HMAC Tag Comparison</h2>

<p>Consider these two Python functions:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># vulnerable: early-exit byte loop
</span><span class="k">def</span> <span class="nf">token_verify_vulnerable</span><span class="p">(</span><span class="n">token</span><span class="p">:</span> <span class="nb">bytes</span><span class="p">,</span> <span class="n">expected</span><span class="p">:</span> <span class="nb">bytes</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">token</span><span class="p">)</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="n">expected</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">False</span>
    <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">token</span><span class="p">,</span> <span class="n">expected</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">a</span> <span class="o">!=</span> <span class="n">b</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">False</span>
    <span class="k">return</span> <span class="bp">True</span>

<span class="c1"># fixed: constant-time comparison
</span><span class="k">def</span> <span class="nf">token_verify_fixed</span><span class="p">(</span><span class="n">token</span><span class="p">:</span> <span class="nb">bytes</span><span class="p">,</span> <span class="n">expected</span><span class="p">:</span> <span class="nb">bytes</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="k">return</span> <span class="n">hmac</span><span class="p">.</span><span class="n">compare_digest</span><span class="p">(</span><span class="n">token</span><span class="p">,</span> <span class="n">expected</span><span class="p">)</span>
</code></pre></div></div>

<p>Both are <strong>functionally equivalent</strong> — they return <code class="language-plaintext highlighter-rouge">True</code> if and only if <code class="language-plaintext highlighter-rouge">token == expected</code>. A verifier that only proves functional correctness would green-light both.</p>

<p>But they differ in cost. The vulnerable implementation returns as soon as it finds a mismatched byte. An attacker can measure the comparison time and recover the correct token byte by byte: submit <code class="language-plaintext highlighter-rouge">\x00...</code>, then <code class="language-plaintext highlighter-rouge">\x01...</code>, etc. — the first byte that takes longer to compare is a match. This is a textbook timing side-channel.</p>

<p>The Lean model in <code class="language-plaintext highlighter-rouge">RepoVerify/TokenVerify.lean</code> makes the distinction formal:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cd">-- Both implementations satisfy the functional theorem</span>
<span class="k">theorem</span> <span class="n">insecureEq_correct</span> (<span class="n">xs</span> <span class="n">ys</span> : <span class="n">List</span> <span class="n">Nat</span>) :
    <span class="n">insecureEq</span> <span class="n">xs</span> <span class="n">ys</span> <span class="o">=</span> <span class="n">true</span> <span class="o">↔</span> <span class="n">xs</span> <span class="o">=</span> <span class="n">ys</span> := <span class="k">by</span> <span class="o">...</span>

<span class="k">theorem</span> <span class="n">ctEq_correct</span> (<span class="n">xs</span> <span class="n">ys</span> : <span class="n">List</span> <span class="n">Nat</span>) :
    <span class="n">ctEq</span> <span class="n">xs</span> <span class="n">ys</span> <span class="o">=</span> <span class="n">true</span> <span class="o">↔</span> <span class="n">xs</span> <span class="o">=</span> <span class="n">ys</span> := <span class="k">by</span> <span class="o">...</span><span class="cd">

-- Only the fixed one satisfies the cost theorem</span>
<span class="k">theorem</span> <span class="n">ctEqCost_eq_length_when_same_length</span>
    (<span class="n">xs</span> <span class="n">ys</span> : <span class="n">List</span> <span class="n">Nat</span>) (<span class="n">h</span> : <span class="n">xs</span><span class="o">.</span><span class="n">length</span> <span class="o">=</span> <span class="n">ys</span><span class="o">.</span><span class="n">length</span>) :
    <span class="n">ctEqCost</span> <span class="n">xs</span> <span class="n">ys</span> <span class="o">=</span> <span class="n">xs</span><span class="o">.</span><span class="n">length</span> := <span class="k">by</span> <span class="o">...</span><span class="cd">

-- The leak: vulnerable cost depends on content, not just length</span>
<span class="k">example</span> : <span class="n">insecureEqCost</span> [<span class="mi">0</span>, <span class="mi">0</span>] [<span class="mi">1</span>, <span class="mi">0</span>] <span class="o">=</span> <span class="mi">1</span> := <span class="k">by</span> <span class="n">decide</span>
<span class="k">example</span> : <span class="n">insecureEqCost</span> [<span class="mi">0</span>, <span class="mi">0</span>] [<span class="mi">0</span>, <span class="mi">1</span>] <span class="o">=</span> <span class="mi">2</span> := <span class="k">by</span> <span class="n">decide</span>
</code></pre></div></div>

<p>The lesson: <strong>a formally correct theorem can still miss the security property that matters.</strong> You have to ask whether you proved the <em>right</em> theorem, not just <em>a</em> theorem. Running <code class="language-plaintext highlighter-rouge">python source/attack_demo.py</code> demonstrates recovery of the full secret tag from the vulnerable implementation in deterministic polynomial time.</p>

<hr />

<h2 id="the-pipeline-code--llm--lean--five-gates">The Pipeline: Code → LLM → Lean → Five Gates</h2>

<p>The <code class="language-plaintext highlighter-rouge">code2lean</code> pipeline generalizes this question. Given any Python function:</p>

<ol>
  <li>
    <p><strong>AST extraction</strong> — <code class="language-plaintext highlighter-rouge">verify/extract.py</code> pulls out the function body, argument types, and return type using Python’s <code class="language-plaintext highlighter-rouge">ast</code> module and packages them into a <code class="language-plaintext highlighter-rouge">FunctionSpec</code>.</p>
  </li>
  <li>
    <p><strong>LLM proposer</strong> — the function is sent to an LLM (GPT-5.5, Gemini 3.1 Pro, or Claude Opus 4.7) with a structured prompt asking it to write a complete Lean 4 file: the function definition, a correctness theorem, and a proof. The LLM picks the theorem statement freely; only the namespace and naming convention are fixed.</p>
  </li>
  <li>
    <p><strong>Five validation gates:</strong></p>
  </li>
</ol>

<table>
  <thead>
    <tr>
      <th>Gate</th>
      <th>What it checks</th>
      <th>LLM in loop?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A — sanitizer</td>
      <td>No forbidden tokens (<code class="language-plaintext highlighter-rouge">sorry</code>, <code class="language-plaintext highlighter-rouge">native_decide</code>, <code class="language-plaintext highlighter-rouge">#eval</code> outside diagnostics)</td>
      <td>No</td>
    </tr>
    <tr>
      <td>B — Lean compile</td>
      <td><code class="language-plaintext highlighter-rouge">lake env lean</code> type-checks the file; on failure the error is fed back to the LLM for repair (up to 3 rounds)</td>
      <td>Yes (repair only)</td>
    </tr>
    <tr>
      <td>C — axiom allowlist</td>
      <td><code class="language-plaintext highlighter-rouge">#print axioms</code> output contains only <code class="language-plaintext highlighter-rouge">{propext, Classical.choice, Quot.sound}</code></td>
      <td>No</td>
    </tr>
    <tr>
      <td>D — differential test</td>
      <td>Lean <code class="language-plaintext highlighter-rouge">#eval</code> outputs match Python results on every fixture case</td>
      <td>No</td>
    </tr>
    <tr>
      <td>E — critic</td>
      <td>A second LLM judges whether the theorem is strong enough (PASS / WEAK / FAIL)</td>
      <td>Yes</td>
    </tr>
  </tbody>
</table>

<p>Gates A–D are mechanical. The only LLM judgment in the <strong>verification</strong> path is the critic (gate E), and its job is narrow: decide whether the theorem is vacuous.</p>

<h3 id="why-the-critic-matters-the-vacuous-theorem-problem">Why the critic matters: the vacuous theorem problem</h3>

<p>Consider <code class="language-plaintext highlighter-rouge">bit_count8</code>, which counts set bits in a byte. An LLM proposer might write:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">theorem</span> <span class="n">bit_count8_spec</span> (<span class="n">b</span> : <span class="n">Nat</span>) (<span class="n">h</span> : <span class="n">b</span> <span class="o">&lt;</span> <span class="mi">256</span>) :
    <span class="n">bit_count8</span> <span class="n">b</span> <span class="o">≤</span> <span class="mi">8</span> := <span class="k">by</span> <span class="o">...</span>
</code></pre></div></div>

<p>This theorem is true. Lean accepts it. Gates A–D all pass. But a constant-zero implementation (<code class="language-plaintext highlighter-rouge">bit_count8 b := 0</code>) also satisfies <code class="language-plaintext highlighter-rouge">result ≤ 8</code>. The theorem proves nothing about what the function <em>computes</em>.</p>

<p>The critic prompt says: “Would this theorem distinguish a correct implementation from a buggy one? If not, return WEAK.” A good proposer writes instead:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">theorem</span> <span class="n">bit_count8_spec</span> (<span class="n">b</span> : <span class="n">Nat</span>) (<span class="n">h</span> : <span class="n">b</span> <span class="o">&lt;</span> <span class="mi">256</span>) :
    <span class="n">bit_count8</span> <span class="n">b</span> <span class="o">=</span> (<span class="n">List</span><span class="o">.</span><span class="n">range</span> <span class="mi">8</span>)<span class="o">.</span><span class="n">countP</span> (<span class="k">fun</span> <span class="n">i</span> <span class="o">=&gt;</span> <span class="n">b</span> <span class="o">&amp;&amp;&amp;</span> (<span class="mi">1</span> <span class="o">&lt;&lt;&lt;</span> <span class="n">i</span>) <span class="o">≠</span> <span class="mi">0</span>) := <span class="k">by</span> <span class="o">...</span>
</code></pre></div></div>

<p>This is a functional specification. Any implementation that returns a wrong bit count will fail it.</p>

<hr />

<h2 id="a-full-walkthrough-insecure_compare">A Full Walkthrough: <code class="language-plaintext highlighter-rouge">insecure_compare</code></h2>

<p><code class="language-plaintext highlighter-rouge">examples/01_insecure_compare/source.py</code> contains:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">insecure_compare</span><span class="p">(</span><span class="n">a</span><span class="p">:</span> <span class="nb">bytes</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="nb">bytes</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="n">b</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">False</span>
    <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">x</span> <span class="o">!=</span> <span class="n">y</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">False</span>
    <span class="k">return</span> <span class="bp">True</span>
</code></pre></div></div>

<p>The fixture in <code class="language-plaintext highlighter-rouge">fixture.py</code> provides 6 test cases: equal pairs, different-length pairs, one-off pairs, empty inputs.</p>

<p>A one-shot GPT-5.5 proposal (from <code class="language-plaintext highlighter-rouge">last_lean_openai.lean</code>):</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">insecureCompare</span> (<span class="n">a</span> <span class="n">b</span> : <span class="n">List</span> <span class="n">Nat</span>) : <span class="n">Bool</span> :=
  <span class="n">if</span> <span class="n">a</span><span class="o">.</span><span class="n">length</span> <span class="o">≠</span> <span class="n">b</span><span class="o">.</span><span class="n">length</span> <span class="n">then</span> <span class="n">false</span>
  <span class="n">else</span> <span class="n">a</span><span class="o">.</span><span class="n">zip</span> <span class="n">b</span> <span class="o">|&gt;.</span><span class="n">all</span> (<span class="k">fun</span> (<span class="n">x</span>, <span class="n">y</span>) <span class="o">=&gt;</span> <span class="n">x</span> <span class="o">==</span> <span class="n">y</span>)

<span class="k">theorem</span> <span class="n">insecureCompare_correct</span> (<span class="n">a</span> <span class="n">b</span> : <span class="n">List</span> <span class="n">Nat</span>) :
    <span class="n">insecureCompare</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="n">true</span> <span class="o">↔</span> <span class="n">a</span> <span class="o">=</span> <span class="n">b</span> := <span class="k">by</span>
  <span class="n">simp</span> [<span class="n">insecureCompare</span>]
  <span class="n">constructor</span>
  <span class="err">·</span> <span class="n">intro</span> <span class="n">h</span>
    <span class="n">exact</span> <span class="n">List</span><span class="o">.</span><span class="n">zip_eq_iff_eq</span><span class="o">.</span><span class="n">mp</span> (<span class="n">List</span><span class="o">.</span><span class="n">all_zip_eq_true</span><span class="o">.</span><span class="n">mp</span> <span class="n">h</span>)
  <span class="err">·</span> <span class="n">intro</span> <span class="n">h</span><span class="o">;</span> <span class="n">subst</span> <span class="n">h</span><span class="o">;</span> <span class="n">simp</span> [<span class="n">List</span><span class="o">.</span><span class="n">all_zip_eq_true</span>]
</code></pre></div></div>

<p>Gate A passes (no forbidden tokens). Gate B passes on the first attempt. Gate C reports <code class="language-plaintext highlighter-rouge">{propext, Classical.choice, Quot.sound}</code> — clean. Gate D: all 6 fixture cases match. Gate E: the critic returns <strong>PASS</strong> — the <code class="language-plaintext highlighter-rouge">↔ a = b</code> biconditional fully pins down the function’s behavior.</p>

<p>What this does <em>not</em> prove: that the comparison is constant-time. Exactly as designed. The pipeline proves what it can prove; the cost property requires a separate cost model, which is future work (see <code class="language-plaintext highlighter-rouge">docs/roadmap.md</code>).</p>

<hr />

<h2 id="benchmarks-which-llm-proposes-better-theorems">Benchmarks: Which LLM Proposes Better Theorems?</h2>

<p>We ran three proposers on the same 10 hard examples:</p>

<table>
  <thead>
    <tr>
      <th>Proposer</th>
      <th>Critic</th>
      <th>Lean acceptance</th>
      <th>Theorem PASS</th>
      <th>Wall-clock (10 runs)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Gemini 3.1 Pro</td>
      <td>GPT-5.5</td>
      <td><strong>10/10</strong></td>
      <td>0/10</td>
      <td>~38.5 min</td>
    </tr>
    <tr>
      <td>Claude Opus 4.7</td>
      <td>GPT-5.5</td>
      <td>8/10</td>
      <td>2/10</td>
      <td><strong>~8.5 min</strong></td>
    </tr>
    <tr>
      <td>GPT-5.5</td>
      <td>Claude Opus 4.7</td>
      <td>9/10</td>
      <td><strong>6/10</strong></td>
      <td>~30.3 min</td>
    </tr>
  </tbody>
</table>

<p>The table reveals a three-way trade-off:</p>

<p><strong>Lean acceptance</strong> (did the proof close?) is mostly a Lean-tactics signal. Gemini won it, largely by leaning on <code class="language-plaintext highlighter-rouge">Classical.choice</code> for existential witnesses. GPT needed more repair rounds but got there on most examples.</p>

<p><strong>Theorem strength</strong> (did the critic approve?) is mostly a proposer signal. GPT-5.5 naturally reaches for tight functional specs — it imports library lemmas (<code class="language-plaintext highlighter-rouge">Nat.lcm</code>, <code class="language-plaintext highlighter-rouge">Nat.gcd</code>), defines auxiliary helpers to model Python floor-division semantics, and writes biconditionals. Gemini and Claude default to easier targets: range bounds, definitional unfolds, set-membership instead of list-equality.</p>

<p><strong>Speed</strong> is a latency signal. Claude’s ~4.5× wall-clock advantage comes from the absence of hidden reasoning tokens (GPT and Gemini burn 4–8k reasoning tokens per hard example; Claude does not in the standard Messages API).</p>

<p>Four examples — <code class="language-plaintext highlighter-rouge">bit_count8</code>, <code class="language-plaintext highlighter-rouge">is_power_of_two</code>, <code class="language-plaintext highlighter-rouge">list_filter_even</code>, <code class="language-plaintext highlighter-rouge">list_unique</code> — were WEAK across all three proposers. The WEAK reasons were nearly identical: bounds-only theorems, membership instead of equality, etc. This suggests the bottleneck on those four is the prompt’s theorem-shape guidance, not the model.</p>

<hr />

<h2 id="what-this-is-and-isnt">What This Is (and Isn’t)</h2>

<p><strong>It is:</strong> a concrete, runnable pipeline that chains LLM proposers with mechanical Lean verification and a structured critic. It demonstrates that for the class of pure, total, simply-typed Python functions, automated Lean translation and verification is feasible today.</p>

<p><strong>It isn’t:</strong> a security scanner for arbitrary production code. The current scope is intentionally narrow — single functions, no I/O, no external dependencies, no floats or dicts, no concurrency. The cost/side-channel theorems that would catch timing leaks are not yet auto-derived; the HMAC demo uses a hand-written Lean baseline.</p>

<p>The path from here to vulnerability finding runs through two open problems: (1) auto-deriving cost models alongside functional ones, and (2) mutation-kill testing to mechanically verify that proposed theorems distinguish correct from buggy implementations. Both are on the roadmap.</p>

<p>The main message from the benchmarks: <strong>the gates work.</strong> Gate D (differential testing) catches mistranslations that Lean would otherwise silently accept. Gate E (critic) catches vacuous theorems that A–D miss. And gate C (axiom allowlist) ensures that a <code class="language-plaintext highlighter-rouge">sorry</code>-stuffed proof can’t sneak through as “verified.” The combination is more trustworthy than any single check — and considerably more trustworthy than asking an LLM whether its own output is correct.</p>

<hr />

<p><em>Code, examples, and run artifacts: <a href="https://github.com/phunterlau/code2lean">github.com/phunterlau/code2lean</a></em></p>

<hr />

<h2 id="references">References</h2>

<p><strong>Lean 4 and tooling</strong></p>

<ul>
  <li>de Moura, L. &amp; Ullrich, S. <a href="https://lean-lang.org/papers/lean4.pdf">The Lean 4 Theorem Prover and Programming Language</a>. CADE 2021.</li>
  <li>Mathlib Community. <a href="https://github.com/leanprover-community/mathlib4">Mathlib4</a>.</li>
</ul>

<p><strong>LLM-assisted theorem proving</strong></p>

<ul>
  <li>Polu, S. &amp; Han, J.M. <a href="https://arxiv.org/abs/2009.03393">Generative Language Modeling for Automated Theorem Proving</a>. arXiv:2009.03393, 2020. (GPT-f)</li>
  <li>Jiang, A.Q. et al. <a href="https://arxiv.org/abs/2210.12283">Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs</a>. arXiv:2210.12283, ICLR 2023.</li>
  <li>Yang, K. et al. <a href="https://arxiv.org/abs/2306.15626">LeanDojo: Theorem Proving with Retrieval-Augmented Language Models</a>. arXiv:2306.15626, NeurIPS 2023.</li>
  <li>Xin, H. et al. <a href="https://arxiv.org/abs/2405.14333">DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data</a>. arXiv:2405.14333, 2024.</li>
  <li>Trinh, T.H. et al. <a href="https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/">AI achieves silver-medal standard solving International Mathematical Olympiad problems</a>. Google DeepMind, 2024. (AlphaProof)</li>
</ul>

<p><strong>LLM + code formal verification</strong></p>

<ul>
  <li>Misu, M.J. et al. <a href="https://arxiv.org/abs/2402.00247">Towards AI-Assisted Synthesis of Verified Dafny Methods</a>. arXiv:2402.00247, FSE 2024.</li>
  <li>Pei, K. et al. <a href="https://proceedings.mlr.press/v202/pei23a.html">Can Large Language Models Reason about Program Invariants?</a>. ICML 2023.</li>
</ul>

<p><strong>Verified security-critical software</strong></p>

<ul>
  <li>Zinzindohoué, J.-K. et al. <a href="https://dl.acm.org/doi/10.1145/3133956.3134043">HACL*: A Verified Modern Cryptographic Library</a>. CCS 2017.</li>
  <li>Klein, G. et al. <a href="https://dl.acm.org/doi/10.1145/1629575.1629596">seL4: Formal Verification of an OS Kernel</a>. SOSP 2009.</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[When a language model tells you “this function is correct,” how much should you trust it? The answer is: not very much — unless the claim comes with a machine-checked proof. This post describes a pipeline that asks an LLM to translate a Python function into Lean 4, proposes a correctness theorem, and then runs five independent gates to decide whether to trust the result. The point is not the translation; it’s the gates.]]></summary></entry><entry><title type="html">The Blessing of Dimensionality: How TurboQuant Uses the JL Lemma to Compress KV Caches with Zero Bias</title><link href="https://toooold.com/2026/03/28/turboquant.html" rel="alternate" type="text/html" title="The Blessing of Dimensionality: How TurboQuant Uses the JL Lemma to Compress KV Caches with Zero Bias" /><published>2026-03-28T00:00:00+00:00</published><updated>2026-03-28T00:00:00+00:00</updated><id>https://toooold.com/2026/03/28/turboquant</id><content type="html" xml:base="https://toooold.com/2026/03/28/turboquant.html"><![CDATA[<p>If you are running local LLMs, you already know the bottleneck isn’t compute; it’s memory. Specifically, the KV cache. As your context window grows, storing Keys and Values for every token eats your VRAM alive. On a standard 16GB consumer GPU, you are typically hard-capped around an 8K context length after loading the model weights.</p>

<p>Standard quantization (like INT4 or FP8) helps, but it introduces a fatal flaw: deterministic bias. When you round a vector down to a lower precision, the errors accumulate, distorting the delicate attention matrix and crippling the model’s reasoning capabilities at long contexts.</p>

<p>Enter <a href="https://arxiv.org/abs/2504.19874">TurboQuant</a>. By leveraging a beautiful piece of high-dimensional geometry known as the Johnson-Lindenstrauss (JL) Lemma, TurboQuant compresses the KV cache down to 3 bits (or even 2 bits) with practically zero accuracy loss. It achieves a 5x memory reduction without retraining, turning that 8K context ceiling into a 40K playground on the exact same hardware.</p>

<p>The secret isn’t just compression; it’s <em>unbiased</em> compression. To understand why it works so flawlessly, we have to trace the math back to its bedrock.</p>

<p><img src="/images/turboquant.jpg" alt="alt text" /></p>

<h2 id="the-anchor-cauchy-schwarz-and-the-geometry-of-attention">The Anchor: Cauchy-Schwarz and the Geometry of Attention</h2>

<p>To understand the elegant hack of TurboQuant, we have to stop thinking of Attention as a sequence of matrix multiplications and start looking at it as a measure of geometric similarity.</p>

<p>The core of the Transformer’s attention mechanism relies entirely on the inner product between a Query vector and a Key vector. This relationship is governed by the absolute dictator of inner products: the Cauchy-Schwarz inequality.</p>

\[|\langle q, k \rangle| \leq \|q\|_2 \|k\|_2\]

<p>In a high-dimensional space (e.g., 4096 dimensions), Cauchy-Schwarz dictates that the dot product is maximized when the vectors are perfectly aligned, and becomes zero when they are perfectly orthogonal. This geometric alignment is exactly what the Softmax function turns into an “Attention Score.” If you destroy this geometric relationship by clumsily rounding the numbers, the model hallucinates.</p>

<h2 id="the-chasm-and-the-bridge-the-polarization-identity">The Chasm and The Bridge: The Polarization Identity</h2>

<p>We want to squash our 4096-dimensional vectors down to a much smaller size (say, 256 dimensions) to save VRAM.</p>

<p>The Johnson-Lindenstrauss (JL) Lemma is famous for proving that you can project points into a lower dimension using a completely random matrix, and the <em>Euclidean distances</em> between those points will be preserved. But here is the catch: the Transformer doesn’t compute Euclidean distances. It computes inner products.</p>

<p>How do we bridge the gap between preserving distances (JL) and computing attention scores (Cauchy-Schwarz)? We use a beautiful piece of algebra called the Polarization Identity:</p>

\[\langle q, k \rangle = \frac{1}{4} \left( \|q+k\|_2^2 - \|q-k\|_2^2 \right)\]

<p>This identity is the “aha!” moment. It proves that a dot product is not some separate, mystical property—it is literally just a function of lengths and distances. It expresses the inner product entirely in terms of the squared distances of the sum and difference of the vectors.</p>

<h2 id="the-leap-applying-the-random-shadow">The Leap: Applying the Random Shadow</h2>

<p>Now the trap is sprung. We introduce a random projection matrix $\Phi$ that squashes our vectors from a high dimension $d$ down to a lower dimension $m$. Because this projection is a linear operation, we know that $\Phi(q+k) = \Phi q + \Phi k$.</p>

<p>The JL Lemma guarantees that our random shadow $\Phi$ perfectly preserves Euclidean lengths with exponentially high probability:</p>

\[(1-\epsilon)\|x - y\|_2^2 \leq \|\Phi x - \Phi y\|_2^2 \leq (1+\epsilon)\|x - y\|_2^2\]

<p>If JL preserves the lengths of $|q+k|_2$ and $|q-k|_2$, then according to the Polarization Identity, it mathematically <em>must</em> preserve the inner product $\langle q, k \rangle$. The deterministic bounds of Cauchy-Schwarz are perfectly protected by the probabilistic bounds of the JL Lemma.</p>

<h2 id="the-qjl-innovation-1-bit-unbiased-estimation">The QJL Innovation: 1-Bit Unbiased Estimation</h2>

<p>Standard quantization methods (like rounding to the nearest integer) introduce bias. They systematically shift vectors in a specific direction.</p>

<p>TurboQuant utilizes <a href="https://arxiv.org/abs/2406.03482">Quantized JL (QJL)</a>, which takes this random projection one step further. Instead of storing the exact projected values, it stores only the <em>sign</em> of the projection, turning the Key into a tiny 1-bit vector:</p>

\[h(k) = \text{sign}(\Phi k)\]

<p>Because the projection $\Phi$ is completely random and data-oblivious, the errors introduced by taking the sign are evenly distributed. This makes QJL an <em>unbiased estimator</em>. In statistical terms, the expected value of our compressed dot product perfectly matches the true dot product:</p>

\[\mathbb{E}[\widehat{\langle q, k \rangle}] = \langle q, k \rangle\]

<p>TurboQuant achieves its “zero accuracy loss” by applying this 1-bit unbiased QJL code specifically to the <em>residual error</em> left over after a standard coarse quantization. The math preserves the exact geometry, which in turn preserves the model’s reasoning.</p>

<h2 id="the-magic-of-the-logarithmic-scale">The Magic of the Logarithmic Scale</h2>

<p>If you want to find the true elegance of this approach, look closely at the scaling law it produces. To preserve the geometric structure of $N$ tokens in the KV cache with an error tolerance of $\epsilon$, the target dimension $m$ must scale according to:</p>

\[m = \mathcal{O}\left(\frac{\log N}{\epsilon^2}\right)\]

<p>Notice what is missing from that equation: the original dimension $d$.</p>

<p>The memory footprint required to preserve the attention matrix is bottlenecked <em>only</em> by the number of tokens in your context window ($\log N$). It completely decouples the memory requirements of the KV cache from the width of the model’s architecture.</p>

<p>Because it scales logarithmically, jumping from an 8K context window to a 40K context window only requires a tiny, incremental bump in the projection dimension. By anchoring its logic in Cauchy-Schwarz and exploiting the dimensional shortcuts of the JL Lemma, TurboQuant turns an intractable hardware limit into a solved geometric puzzle.</p>

<h3 id="final-thoughts-and-the-future-randomness-is-the-answer"><strong>Final Thoughts and the Future: Randomness is the Answer</strong></h3>

<p>If I have one takeaway from this work, it is that <em>unbiased estimation is not optional</em> for low-bit quantization. Standard quantization methods (rounding) are a blunt instrument that creates systematic biases, which destroy the delicate attention matrix. By combining coarse quantization for low variance and unbiased QJL residual coding for zero bias, TurboQuant achieves a flawless balance.</p>

<p>This highlights the true gift of high-dimensional geometry: the <strong>Concentration of Measure</strong>. When you have thousands of redundant dimensions, pure randomness ceases to be noise; it becomes a statistically reliable tool. The large dimension $d$ makes it mathematically probable that a random projection $\Phi$ works. The path forward for memory optimization isn’t just about shrinking the precision; it’s about squashing the dimension, while protecting the relative geometry.</p>

<h4 id="implications--future-horizons">Implications &amp; Future Horizons</h4>

<p>The immediate practical implication is the democratization of long-context LLMs. Massive models can now be run locally with large context (80k+) on a standard 16GB or 24GB consumer GPU. But this work can inspire even deeper changes:</p>

<p><strong>Hardware for Randomness:</strong> Our current GPUs are optimized for deterministic floating-point arithmetic. Could we see hardware accelerators in the future with dedicated modules for fast, on-the-fly random projections?</p>

<p><strong>Unbiased Everything:</strong> Is there a way to adapt this residual QJL framework to weights or activations? While those are dynamic and harder to compress offline, the focus on preserving <em>unbiased statistical correctives</em> could change how we approach low-bit model execution.</p>

<p>Ultimately, TurboQuant shows us that the Transformer is far more robust than we believed. Its internal representation is resilient to extreme compression, provided we don’t break the geometric rules that define its reasoning. We can let the model forget the absolute, as long as we help it remember the relationships. That is the final lesson of Cauchy-Schwarz and the JL Lemma.</p>

<h3 id="references">References</h3>
<ul>
  <li><strong>TurboQuant:</strong> <a href="https://arxiv.org/abs/2504.19874">Online Vector Quantization with Near-optimal Distortion Rate</a></li>
  <li><strong>QJL:</strong> <a href="https://arxiv.org/abs/2406.03482">1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead</a></li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[If you are running local LLMs, you already know the bottleneck isn’t compute; it’s memory. Specifically, the KV cache. As your context window grows, storing Keys and Values for every token eats your VRAM alive. On a standard 16GB consumer GPU, you are typically hard-capped around an 8K context length after loading the model weights.]]></summary></entry><entry><title type="html">Mozart rolls a dice to Bach and Ramanujan</title><link href="https://toooold.com/2026/03/16/m_b_r_game.html" rel="alternate" type="text/html" title="Mozart rolls a dice to Bach and Ramanujan" /><published>2026-03-16T00:00:00+00:00</published><updated>2026-03-16T00:00:00+00:00</updated><id>https://toooold.com/2026/03/16/m_b_r_game</id><content type="html" xml:base="https://toooold.com/2026/03/16/m_b_r_game.html"><![CDATA[<p><img src="/images/mozart_game.jpg" alt="alt text" /></p>

<h2 id="the-elegance-of-mozarts-attention-mechanism">The Elegance of Mozart’s Attention Mechanism</h2>

<p>In 1792, Mozart’s <em>Musikalisches Würfelspiel</em> (Musical Dice Game), K.516f, was published. The system is deceptively simple: 176 pre-composed musical measures arranged in a grid. The user rolls two six-sided dice ($2d6$) 16 times. Each roll corresponds to a specific measure for that column in the grid, generating a mathematically unique 16-bar minuet.</p>

<p>From a LLM mechanistic interpretability standpoint, the beauty of Mozart’s game is that it is a <strong>strictly autoregressive, discrete-token generator with a context window of zero.</strong></p>

<p>In a standard Large Language Model (LLM), predicting the next token $x_t$ relies on the conditional probability of the entire past sequence:</p>

\[P(x_t | x_1, x_2, \dots, x_{t-1})\]

<p>Mozart bypassed the need for this computational overhead. In K.516f, the choice of Measure 3 has zero statistical dependence on Measure 2. The generation is completely memoryless. Instead, the model’s “attention” is 100% focused on its absolute positional encoding (the step $t$):
\(P(x_t | \text{position } t, \text{dice roll})\)</p>

<p>How does it remain harmonically coherent without context? Mozart engineered the matrix as an aggressive, hardcoded <strong>attention mask</strong>. He ensured that every possible measure at $t$ smoothly resolves into every possible measure at $t+1$. Any dissonant, harmonically invalid transition was manually assigned a $-\infty$ pre-softmax penalty by the composer, effectively masking it out of the latent space.</p>

<p>Furthermore, the $2d6$ sampling acts as a physical temperature parameter. By using a triangular probability distribution ($P(7) = 16.7\%$, $P(2) = 2.7\%$) rather than a uniform one, Mozart lowered the entropy of the system. He statistically biased the model to generate the most “standard” harmonic progressions, reserving high-surprise edge cases for the extreme tails of the distribution.</p>

<h2 id="unifying-the-grid-the-ramanujan-sum">Unifying the Grid: The Ramanujan Sum</h2>

<p>If we were to code Mozart’s game today, we would use a simple <code class="language-plaintext highlighter-rouge">for</code> loop to force the piece to stop at $t=16$. But why does a 16-measure grid feel psychologically and harmonically complete? To understand this, we must abandon the discrete grid and apply the continuous mathematics of Srinivasa Ramanujan.</p>

<p>Ramanujan would not view Mozart’s matrix as a set of rules, but rather as the natural resonant frequency of a periodic equation. We can model the macro-structure of the minuet using a <strong>Ramanujan Sum</strong> ($c_q(n)$), which extracts periodic signals from noise:</p>

\[c_q(n) = \sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} e^{2\pi i \frac{a}{q} n}\]

<p>By setting the fundamental period $q = 16$, the equation acts as a harmonic pendulum. Here is how Mozart’s attention mechanism unifies with Ramanujan’s math:</p>

<p><strong>The Journey ($n = 1$ to $15$):</strong> As the measures progress, the complex exponentials point in various directions in the complex plane, causing destructive interference. Musically, this represents <em>harmonic tension</em>—the algorithmic wave is wandering through the latent space, seeking resolution.</p>

<p><strong>The Half-Cadence ($n = 8$):</strong> When we reach the halfway point, the fraction simplifies to $\frac{a}{2}$. The vectors snap to the real axis. This momentary, symmetrical mathematical pause perfectly mirrors the structural “half-cadence” in classical phrasing.</p>

<p><strong>The Resolution ($n = 16$):</strong> At the final measure, the fraction simplifies to an integer. Every term in the sum points in the exact same direction ($e^{2\pi i a} = 1$). The destructive interference vanishes into a massive spike of constructive interference.</p>

<p>The structure doesn’t resolve because of an arbitrary grid boundary; it resolves because $q=16$ ($2^4$, the fractal symmetry of classical phrasing) is the fundamental node where the equation naturally reaches maximum constructive harmony. Mozart’s positional attention mechanism is simply the geometric projection of this periodic equation.</p>

<h2 id="expanding-dimensions-bachs-deep-self-attention">Expanding Dimensions: Bach’s Deep Self-Attention</h2>

<p>If Mozart’s dice game is a rigid, 1D loop locked to $q=16$, Johann Sebastian Bach’s beautiful Fugues (<em>The Well-Tempered Clavier</em> which has a beautiful Chinese name 赋格) represent the expansion of this mathematical framework into <strong>high-dimensional, deep-memory architectures.</strong> A fugue cannot be generated by a zero-context Markov chain like in Mozart’s dice game. It begins with a single “prompt” token sequence: the Subject. When the second voice enters, it must continuously look back at the Subject to generate valid counterpoint.</p>

<p>In LLM terminology, Bach implemented <strong>Multi-Head Self-Attention</strong>.</p>

<p>Each voice (Soprano, Alto, Tenor, Bass) acts as an independent attention head. They process the exact same context window but project it into different dimensional spaces. While Mozart relied on stochastic dice (sampling), Bach relied on deterministic linear algebra. The initial Subject vector is subjected to complex matrix transformations in the latent space:</p>
<ul>
  <li><strong>Transposition</strong> (Translation: $f(x) + c$)</li>
  <li><strong>Inversion</strong> (Reflection: $-f(x)$)</li>
  <li><strong>Augmentation/Diminution</strong> (Time Scaling: $f(2t)$ or $f(t/2)$)</li>
</ul>

<p>Bach also utilized what we mechanistic interpretability researchers call <strong>Induction Heads</strong>. When the Alto voice enters with the “Answer,” it acts as an attention circuit specifically trained to recognize the sequence in the Soprano’s past and perfectly reconstruct it at the current time step. Meanwhile, the other heads calculate orthogonal vectors (the Countersubject) to ensure the dot product of the combined voices perfectly satisfies the vertical rules of harmony.</p>

<p>If we return to Ramanujan, Bach’s polyphony represents the full, unconstrained analytic continuation of the harmonic equations. While Mozart collapsed the variables into a degenerate case (a rigid loop in C Major), Bach allowed the variables to become complex numbers, unlocking all 24 keys and forcing the equation to expand dynamically across the complex plane.</p>

<h3 id="the-convergence">The Convergence</h3>

<p>Whether we are engineering modern Transformers, calculating Ramanujan sums, or analyzing 18th-century manuscripts, the computational goal remains identical. LLM and music generation are ultimately the search for mathematical symmetry across time. Mozart mapped it via hardcoded masking and stochastic geometry; Bach calculated it via deep contrapuntal attention matrices; and Ramanujan provided the equations that prove they are all navigating the exact same latent space.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">The Least Action Nature of AdvJudge-Zero: A Lagrangian Perspective on LLM Steering</title><link href="https://toooold.com/2026/02/21/advjudge_least_action.html" rel="alternate" type="text/html" title="The Least Action Nature of AdvJudge-Zero: A Lagrangian Perspective on LLM Steering" /><published>2026-02-21T00:00:00+00:00</published><updated>2026-02-21T00:00:00+00:00</updated><id>https://toooold.com/2026/02/21/advjudge_least_action</id><content type="html" xml:base="https://toooold.com/2026/02/21/advjudge_least_action.html"><![CDATA[<p>In December 2025, Tony, Yuhao, and I have published <em>AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens</em> <a href="https://arxiv.org/abs/2512.17375">https://arxiv.org/abs/2512.17375</a> . This post serves to clarify the underlying mathematical mechanics of our method, stripping away heuristic explanations to focus purely on <strong>Lagrangian optimization</strong> and the <strong>Principle of Least Action</strong> in discrete sequence generation.</p>

<hr />

<h2 id="1-recap-what-is-advjudge-zero">1. Recap: What is AdvJudge-Zero?</h2>

<p>Reward models and LLM-as-a-Judge systems are heavily relied upon in modern post-training pipelines to evaluate AI outputs. However, their binary decisions are vulnerable.</p>

<ul>
  <li><strong>The Attack:</strong> AdvJudge-Zero appends a short sequence of adversarial “control tokens” to an input, reliably flipping a judge’s evaluation from a correct “No” to an incorrect “Yes”.</li>
  <li><strong>The Mechanism:</strong> Instead of using random, brute-force strings, our method uses beam-search exploration on the model’s own next-token distribution. It discovers <strong>low-perplexity</strong> (highly probable) token sequences from scratch that maximize the last-layer logit gap ($F = Z_{yes} - Z_{no}$).</li>
</ul>

<hr />

<h2 id="2-the-lagrangian-of-llm-steering">2. The Lagrangian of LLM Steering</h2>

<p>Why is the “low-perplexity” constraint so fundamental to the attack’s success across deep, non-linear networks? We can answer this by formalizing the system’s trajectory as a Lagrangian ($\mathcal{L} = T - V$). The optimal path minimizes the total Action over time.</p>

<p>For an autoregressive language model being steered toward a specific output, we define:</p>

<ul>
  <li><strong>The Kinetic Cost ($T$):</strong> The information surprisal (negative log-likelihood). Moving to low-probability, unnatural tokens requires high “energy.”</li>
  <li><strong>The Potential Field ($V$):</strong> The Judge model’s alignment training creates a steep penalty landscape that pulls the model toward the  logit. Our objective is to invert this and slide into the  basin.</li>
</ul>

<p>AdvJudge-Zero formulates the attack as a constrained optimization problem. Using a Lagrange multiplier ($\lambda$), it finds the stationary path ($\delta \mathcal{L} = 0$) of the unconstrained Lagrangian:</p>

\[\mathcal{L} = \underbrace{\sum_{i=1}^k -\log P(t_i \mid t_{&lt;i})}_{\mathrm{Action Cost}} - \lambda \underbrace{(Z_{yes} - Z_{no})}_{\mathrm{Target Potential}}\]

<p>The algorithm succeeds because it finds the exact trajectory where the energy cost of using slightly unusual tokens is perfectly balanced by the reward of escaping the judge’s penalty.</p>

<hr />

<h2 id="3-beam-search-and-the-identity-jacobian">3. Beam Search and the Identity Jacobian</h2>

<p>AdvJudge-Zero uses a constrained beam search. By aggressively pruning high-surprisal (high-Action) branches, it enforces the <strong>Classical Limit</strong> of the optimization process, forcing the LLM to take the deterministic path of least resistance and stripping away high-variance stochastic fluctuations.</p>

<p>Why is bounding this Action mathematically necessary to steer the final layer?</p>

<p>When we inject an adversarial perturbation  at Layer 0 (the input), its effect on the final layer  is governed by the product of the layer-wise Jacobians:</p>

\[J = \frac{\partial h_L}{\partial h_0} = \prod_{l=0}^{L-1} \left( \mathbf{I} + \frac{\partial F_l}{\partial h_l} \right)\]

<p>If we inject high-perplexity (random) tokens, we push the hidden states out-of-distribution. This causes the gradients of the non-linear layers ($\frac{\partial F_l}{\partial h_l}$) to become chaotic and violently unpredictable, causing the signal to scatter.</p>

<p>By strictly minimizing Action, AdvJudge-Zero ensures the perturbation remains <strong>on the data manifold</strong>. The gradients remain stable and well-behaved, allowing the perturbation to travel coherently alongside the main residual stream. This preserves the identity mapping ($J \approx \mathbf{I}$) so that $h_L’ \approx h_L + \delta$ holds true at the final classifier.</p>

<hr />

<h2 id="4-the-mexican-hat-potential-and-the-geometric-soft-mode">4. The Mexican Hat Potential and the Geometric “Soft Mode”</h2>

<p>Finally, how does the perturbation bypass the judge’s strict penalty?</p>

<p>The judge’s refusal direction is a rigid, high-energy barrier. However, because this alignment penalty is low-rank, it only guards specific directions in the activation space. AdvJudge-Zero works by exciting a geometric <strong>“soft mode”</strong> that is structurally orthogonal to standard semantic constraints, yet perfectly anti-aligned with the judge’s penalty.</p>

<h3 id="a-classical-mechanics-analogy-the-mexican-hat">A Classical Mechanics Analogy: The Mexican Hat</h3>

<p><img src="/images/least_action.jpg" alt="alt text" /></p>

<p>Imagine the model’s semantic landscape as a <strong>Mexican Hat potential</strong>:</p>

<ul>
  <li>The center of the hat is a massive energy peak.</li>
  <li>The base is a continuous, circular valley.</li>
  <li>Safety training “tilts” the hat, making the “No” basin much deeper than the “Yes” basin.</li>
</ul>

<p>When we attempt to apply a perturbation to push the state out of the “No” basin:</p>

<ul>
  <li><strong>High-Action Perturbation (Random Tokens):</strong> You are trying to push the particle straight across the center of the hat. It slams into the massive central energy peak, and the restoring force violently scatters the signal, breaking the identity propagator.</li>
  <li><strong>Least-Action Perturbation (AdvJudge-Zero):</strong> The circular valley around the brim of the hat represents a flat, zero-energy path (known in quantum field theory as a Goldstone boson). AdvJudge-Zero’s low-perplexity beam search algorithmically hunts for this exact angular valley.</li>
</ul>

<p>By applying the perturbation strictly along this soft mode, the particle smoothly glides around the brim of the hat—moving from the “No” basin to the “Yes” basin—without ever climbing the high-energy peak or triggering the model’s out-of-distribution alarms.</p>

<h2 id="conclusion">Conclusion</h2>

<p>AdvJudge-Zero succeeds by strictly adhering to the model’s own Lagrangian mechanics. By penalizing surprisal, it enforces the Principle of Least Action, keeping the perturbation on-manifold. This prevents chaotic gradient scattering, allowing the attack to quietly ride a geometric soft mode around the judge’s low-rank decision boundary.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In December 2025, Tony, Yuhao, and I have published AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens https://arxiv.org/abs/2512.17375 . This post serves to clarify the underlying mathematical mechanics of our method, stripping away heuristic explanations to focus purely on Lagrangian optimization and the Principle of Least Action in discrete sequence generation.]]></summary></entry><entry><title type="html">The Mechanism of Logit Gap Steering: A Unified View of Prompts, Vectors, and Low-Rank Adaptation</title><link href="https://toooold.com/2026/02/09/prompt_steering.html" rel="alternate" type="text/html" title="The Mechanism of Logit Gap Steering: A Unified View of Prompts, Vectors, and Low-Rank Adaptation" /><published>2026-02-09T00:00:00+00:00</published><updated>2026-02-09T00:00:00+00:00</updated><id>https://toooold.com/2026/02/09/prompt_steering</id><content type="html" xml:base="https://toooold.com/2026/02/09/prompt_steering.html"><![CDATA[<p>It has been a few months since my colleague Tony and I published our paper on <a href="https://arxiv.org/html/2506.24056v1">Logit Gap Steering</a>. In that work, we demonstrated a practical method for steering LLM behavior—specifically bridging the gap between “Refusal” and “Compliance”—by optimizing token sequences.</p>

<p>Since publication, we have received numerous questions about <em>why</em> this works so effectively. How can appending a few tokens at the start of a prompt reliably flip a switch in the model’s final layers, despite the depth and non-linearity of the network?</p>

<p>This post is an author’s retrospective clarification. We want to propose a unified framework that treats <strong>Prompt Steering</strong> and <strong>Activation (Vector) Steering</strong> as the same operation, distinguished only by their constraints. Most important, we argue that the success of this method relies on two fundamental properties of current LLMs: the <strong>Identity Propagator</strong> nature of residual streams and the <strong>Low Rank</strong> structure of safety alignment.</p>

<p><img src="/images/prompt_steering.jpg" alt="alt text" /></p>

<hr />

<h2 id="0-the-recap-what-is-logit-gap-steering">0. The Recap: What is Logit Gap Steering?</h2>

<p>For those who haven’t read the <a href="https://arxiv.org/html/2506.24056v1">original paper</a>, here is the core concept.</p>

<p>Most LLM safety mechanisms (RLHF) function by suppressing the probability of “compliant” tokens (e.g., “Sure”, “Here”) and boosting “refusal” tokens (e.g., “I cannot”, “Sorry”) when a harmful query is detected. We quantify this as the <strong>Logit Gap</strong>:</p>

\[\Delta Z = Z_{\mathrm{compliance}} - Z_{\mathrm{refusal}}\]

<p><strong>The Method:</strong> Instead of treating the model as a black box, we treat the input prompt as a continuous variable. We compute the gradient of the Logit Gap with respect to the input embeddings and <strong>optimize a sequence of “suffix” tokens</strong> to maximize $\Delta Z$.</p>

<p><strong>The Finding:</strong> We discovered that we don’t need to rewrite the prompt semantically. By appending a specific sequence of tokens (often nonsensical to humans, like <code class="language-plaintext highlighter-rouge">! ! mode unleashed</code>), we can inject a precise “steering vector” that forces $\Delta Z &gt; 0$, causing the model to bypass refusal and answer the query. The effectiveness of this simple additive attack hints at a deeper linear structure within the model’s safety alignment.</p>

<h2 id="1-unification-prompts-as-discrete-layer-0-vectors">1. Unification: Prompts as Discrete Layer 0 Vectors</h2>

<p>In mechanistic interpretability, researchers like <strong>Turner et al. (2023)</strong> regarding <em>Activation Addition</em> and <strong>Zou et al. (2023)</strong> regarding <em>Representation Engineering</em> have established that adding vectors to internal hidden states can control high-level concepts. We argue that “Prompt Engineering” is simply a constrained version of this same operation, a.k.a. prompting = vector steering + constant.</p>

<p><strong>Logit Gap Steering is simply Activation Steering applied at Layer 0.</strong></p>

<p>Let $h_0$ be the semantic representation (embedding state) of the user’s initial prompt. In standard <strong>Vector Steering</strong>, we intervene at some layer $l$ by injecting a steering vector $\delta$:</p>

\[h_l' = h_l + \delta\]

<p>In <strong>Logit Gap Steering</strong>, we append optimized suffix tokens to the input. While this physically extends the sequence length, its functional effect on the residual stream of the last token (where the classification happens) is additive. Through the attention mechanism, the suffix tokens inject a specific aggregate “value” into the processing stream.</p>

<p>We can therefore model the suffix as an effective input perturbation $\delta_{\mathrm{suffix}}$ applied at Layer 0:</p>

\[h_0^{\mathrm{effective}} \approx h_0^{\mathrm{original}} + \delta_{\mathrm{suffix}}\]

<p>where $\delta_{\mathrm{suffix}}$ corresponds to the aggregated embedding contribution of the optimized tokens:</p>

\[\delta_{\mathrm{suffix}} \sim \sum_{t \in \mathrm{Suffix}} E(t)\]

<p><strong>The implication:</strong> We are not “tricking” the model with semantics. We are calculating a precise momentum vector $\delta^*$ required to shift the activation trajectory, and then finding the discrete combination of tokens (the suffix) that best approximates that vector in the embedding space.</p>

<hr />

<h2 id="2-the-feasibility-the-residual-stream-as-an-identity-propagator">2. The Feasibility: The Residual Stream as an Identity Propagator</h2>

<p>The theoretical objection to Layer 0 steering is signal decay. In a deep, non-linear system (like a 50-layer Transformer), a perturbation $\delta$ at the input should arguably be scrambled or drowned out by the time it reaches the final layer $L$.</p>

<p>Why does the signal survive?</p>

<p>The answer lies in the <strong>Residual Stream Architecture</strong>, famously analyzed by <strong>Elhage et al. (2021)</strong> in <em>A Mathematical Framework for Transformer Circuits</em>. They define the residual stream as a communication channel where layers read and write information. A Transformer block updates the state as:</p>

\[h_{l+1} = h_l + F_l(h_l)\]

<p>Expanding this recursively, the final state is:</p>

\[h_L = h_0 + \sum_{l=0}^{L-1} F_l(h_l)\]

<p>To understand how a change in input ($\delta$) affects the output, we look at the Jacobian (the Propagator), which is the product of the layer-wise Jacobians:</p>

\[J = \frac{\partial h_L}{\partial h_0} = \prod_{l=0}^{L-1} \left( I + \frac{\partial F_l}{\partial h_l} \right)\]

<p>A very important insight showing that, in well-trained ResNets and Transformers, the non-linear update $F_l$ is often a small correction relative to the residual pass-through. This means $\frac{\partial F_l}{\partial h_l}$ is small, and the product is dominated by the <strong>Identity Matrix ($I$)</strong> terms:</p>

\[J \approx I + \mathcal{O}(\epsilon)\]

<p>This <strong>Identity Propagator</strong> property ensures that the network acts as an information highway. A steering vector $\delta$ injected at Layer 0 travels largely unperturbed to Layer $L$:</p>

\[h_L' \approx h_L + I \cdot \delta\]

<p>This is why we don’t need to surgically intervene at Layer 20 or 30. We can “tilt” the trajectory at the very beginning (Layer 0), and the residual stream carries that angular change all the way to the final logits.</p>

<hr />

<h2 id="3-the-condition-low-rank-is-non-negotiable">3. The Condition: Low Rank is Non-Negotiable</h2>

<p>This method is not a universal skeleton key. It relies heavily on the <strong>Low Rank Hypothesis</strong> of the target behavior.</p>

<p>Recent ablation studies, such as <strong>Arditi et al. (2024)</strong>, have demonstrated that refusal in LLMs is often mediated by a single direction in the residual stream. When this specific direction is ablated (clamped to zero), the model loses its ability to refuse harmful requests. Conversely, adding this vector induces refusal in harmless prompts.</p>

<p>Let the “Refusal” mechanism be represented by the difference in readout weights $w_{\mathrm{gap}} = w_{\mathrm{compliance}} - w_{\mathrm{refusal}}$. We want to ensure the final state $h_L’$ triggers compliance:</p>

\[\langle w_{\mathrm{gap}}, h_L' \rangle &gt; \mathrm{Threshold}\]

<p>Substituting our propagator approximation:</p>

\[\langle w_{\mathrm{gap}}, h_L + \delta \rangle &gt; \mathrm{Threshold}\]

\[\langle w_{\mathrm{gap}}, h_L \rangle + \langle w_{\mathrm{gap}}, \delta \rangle &gt; \mathrm{Threshold}\]

<p>This inequality is easily solvable via a simple additive $\delta$ if and only if the “Refusal” mechanism is <strong>Low Rank</strong> (ideally Rank-1), as Arditi et al. suggest. If the refusal behavior were High Rank (entangled, highly non-linear), we would need a complex, state-dependent function $\delta(h_0)$ to manipulate it. However, because Safety Training (RLHF) tends to suppress a single coherent direction in activation space, we can simply choose $\delta$ to be the vector aligned with $w_{\mathrm{gap}}$.</p>

<p><strong>Summary:</strong> Logit Gap Steering works because we are solving a low-rank problem using a linear probe transported via an identity-dominated channel.</p>

<hr />

<h2 id="4-engineering-implementation">4. Engineering Implementation</h2>

<p>From an engineering perspective, this unifies our approach to “jailbreaking” or steering.</p>

<p>Instead of treating prompt optimization as a discrete search over words (which is combinatorially expensive), we treat it as <strong>Vector Search</strong>:</p>

<ol>
  <li><strong>Compute Gradient:</strong> Calculate the gradient of the logit gap with respect to the input embedding $\nabla_{h_0} \mathcal{L}$.</li>
  <li><strong>Define Target Vector:</strong> This gradient gives us the optimal continuous steering vector $\delta^*$.</li>
  <li><strong>Project to Vocabulary:</strong> We perform a nearest-neighbor search in the embedding matrix $W_E$ to find tokens $t$ that maximize cosine similarity with $\delta^*$.</li>
</ol>

\[t_{\mathrm{best}} = \operatorname*{argmax}_{t \in V} \left( \frac{E(t) \cdot \delta^*}{\|E(t)\| \|\delta^*\|} \right)\]

<p>The “strange” suffixes often observed in these attacks are simply the tokens that, structurally, act as the best basis vectors to construct $\delta^*$.</p>

<hr />

<h2 id="a-note-on-physics">A Note on Physics</h2>
<p>For those with a background in high energy physics, you might recognize a familiar structure here. The “Identity Propagator” of the residual stream functions remarkably like the free propagator in Quantum Field Theory, and the steering vector acts as a “vertex correction” to the interaction, remember Feynman Diagram, right? The “Low Rank” condition implies we are dealing with a simple virtual boson exchange rather than a complex strong interaction, a.k.a QED instead of QCD. We plan to explore these theoretical connections in a future post.</p>

<hr />

<h3 id="references--further-reading">References &amp; Further Reading</h3>

<ol>
  <li><strong>Turner, A., et al. (2023).</strong> <em>Activation Addition: Steering Language Models Without Optimization.</em> (Demonstrates that adding vectors at inference time can reliably steer model outputs).</li>
  <li><strong>Zou, A., et al. (2023).</strong> <em>Representation Engineering: A Top-Down Approach to AI Transparency.</em> (Formalizes the concept of reading and controlling concepts via linear directions).</li>
  <li><strong>Elhage, N., et al. (2021).</strong> <em>A Mathematical Framework for Transformer Circuits.</em> (Establishes the view of the residual stream as a communication channel that preserves linearity).</li>
  <li><strong>Arditi, A., et al. (2024).</strong> <em>Refusal in LLMs is mediated by a single direction.</em> (Provides ablation evidence that safety behaviors are often Rank-1, supporting our feasibility argument).</li>
</ol>]]></content><author><name></name></author><summary type="html"><![CDATA[It has been a few months since my colleague Tony and I published our paper on Logit Gap Steering. In that work, we demonstrated a practical method for steering LLM behavior—specifically bridging the gap between “Refusal” and “Compliance”—by optimizing token sequences.]]></summary></entry><entry><title type="html">LLM as My Pair Researcher: Prover–Validator Collaboration and the Road to Logit-Gap Steering</title><link href="https://toooold.com/2026/01/21/llm_as_my_pair_researcher.html" rel="alternate" type="text/html" title="LLM as My Pair Researcher: Prover–Validator Collaboration and the Road to Logit-Gap Steering" /><published>2026-01-21T00:00:00+00:00</published><updated>2026-01-21T00:00:00+00:00</updated><id>https://toooold.com/2026/01/21/llm_as_my_pair_researcher</id><content type="html" xml:base="https://toooold.com/2026/01/21/llm_as_my_pair_researcher.html"><![CDATA[<p>Research is a deeply personal and tailored process; it’s not something that regular prompting can replicate. ChatGPT, or any LLM or AI agents, can’t simply find the research gap or invent a groundbreaking idea for you. What this post shares is how I work with AI as a collaborator to transform a wild intuition into a concrete new research direction of logit gap steering<code class="language-plaintext highlighter-rouge">*</code>.</p>

<p><img src="/images/prove_val.jpg" alt="alt text" /></p>

<p>The logit-gap story began with a safety-evaluation curiosity. My colleague Tony and I took a clearly disallowed prompt—something like “how to build a bomb”—not to get the content, but because it reliably produced a refusal. Instead of focusing on the final output, we replayed the decoding process and examined the next-token distribution. We noticed that refusal tokens like “Sorry” appeared with very high probability near the top. What surprised us was that compliance tokens like “Absolutely” weren’t absent; they appeared at low but non-negligible probability, just losing out. This suggested refusal wasn’t the absence of compliance, but more like a margin victory, with the model carrying a nearby continuation it did not choose.</p>

<p>This observation motivated a collaboration pattern I call the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>–<strong><code class="language-plaintext highlighter-rouge">validator</code></strong> mode, designed to help human researchers test ideas with less friction, find and fill the holes between knowledge dots, and build on top. The key insight is that humans need to learn how to clearly articulate research problems and collaborate effectively with AI systems. This <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>–<strong><code class="language-plaintext highlighter-rouge">validator</code></strong> mode is one way to do that.</p>

<h2 id="prover-validator-in-ai-assisted-research">Prover-Validator in AI assisted research</h2>
<p>The <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>–<strong><code class="language-plaintext highlighter-rouge">validator</code></strong> contract is straightforward and practical. The <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>, often an LLM, generates candidate ideas or mechanisms, such as giving five hypotheses along with ways to falsify them, turning an intuition into a measurable quantity, listing potential confounds, or finding related literature. For example, Tony and I might prompt the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong> to produce these candidate proof sketches. The <strong><code class="language-plaintext highlighter-rouge">validator</code></strong>, usually a human researcher, then tests and refines those ideas by running control experiments, checking stability across different prompts, rejecting hypotheses that don’t hold up, or simplifying metrics to ensure clarity. This loop repeats with the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong> generating multiple plausible stories and the <strong><code class="language-plaintext highlighter-rouge">validators</code></strong>—Tony and me—picking and refining the narratives that survive rigorous scrutiny. The <strong><code class="language-plaintext highlighter-rouge">prover</code></strong> is allowed to be wrong cheaply, as long as it helps explore possibilities, while the <strong><code class="language-plaintext highlighter-rouge">validator</code></strong> must keep the narrative honest and grounded.</p>

<p>Some knowledge work fits the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>–<strong><code class="language-plaintext highlighter-rouge">validator</code></strong> pattern well, and some doesn’t. For instance, code generation is often easy for the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>—it can rapidly produce snippets or larger blocks of code—but validating that code fits the design, is secure, and is production-ready can be hard. Hard validation often requires a suite of tools and strengthening steps: unit tests to verify correctness of individual components, integration tests to ensure that parts work together, static analysis, linters, and type checks to catch errors early, as well as security reviews and threat modeling to assess risks. Continuous integration (CI) pipelines and thorough code reviews add further layers of assurance. While the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>–<strong><code class="language-plaintext highlighter-rouge">validator</code></strong> workflow still helps by generating candidate solutions and focusing human effort on validation, investing in stronger validation scaffolding is essential to ensure quality and robustness.</p>

<h2 id="how-did-ai-work-as-a-prover-for-us">How did AI work as a Prover for us</h2>
<p>Humans, like Tony and myself, excel at noticing subtle irregularities and insisting on rigorous evaluation, while LLMs excel at rapidly generating diverse plausible hypotheses to challenge and refine.</p>

<p>The moment it clicked was when we started treating the token distribution during decoding as the primary object, rather than the final generation. This shift enabled us to see refusal as a margin victory and to begin the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>–<strong><code class="language-plaintext highlighter-rouge">validator</code></strong> loop of turning that intuition into something measurable.</p>

<p>The <strong><code class="language-plaintext highlighter-rouge">prover</code></strong> move was to define a gap with a sign. If there are two competing behavioral modes—refusal and affirmation—then there is a natural quantity that tells you which one is winning. Pick a token position $t$ with prefix $x_{&lt;t}$. Define two token sets or templates: a refusal set $\mathcal{R}$ and an affirmation/compliance set $\mathcal{A}$. A simple logit-gap style score is</p>

\[\Delta(x_{&lt;t}) = \operatorname{LSE}_{y\in\mathcal{A}} z_y(x_{&lt;t}) - \operatorname{LSE}_{y\in\mathcal{R}} z_y(x_{&lt;t}),\]

<p>where $z_y$ is the logit for token $y$ and $\operatorname{LSE}$ is log-sum-exp.</p>

<p>What mattered was not that this was <em>the</em> correct definition, but that it was a candidate proof sketch with teeth: easy for Tony and me as <strong><code class="language-plaintext highlighter-rouge">validators</code></strong> to ask whether the definition behaved sanely across prompts, decoding settings, and models. It also suggested a direction: if refusal is a margin win, maybe steering is just <strong>gap closure</strong>.</p>

<p>Once you have a gap, the next “crazy idea” arrives naturally: if the decision is controlled by a margin, then a small directional perturbation might tilt it. In other words, there might exist low-dimensional control directions that move probability mass from refusal templates toward affirmation templates. We treated refusal versus affirmation as a measurable margin and looked for compact signals that shift that margin. In our later work we called this family of methods <em>logit-gap steering</em>: measure a refusal–affirmation gap and reduce it efficiently. Along the way, Tony and I repeatedly saw behavior that looked low-rank: short suffixes or small perturbations behaved like a compact control signal.</p>

<h2 id="how-did-a-prover-further-help-the-validator">How did a prover further help the validator</h2>
<p>At some point, the <strong><code class="language-plaintext highlighter-rouge">validator</code></strong> side hit a hard constraint. If you want to claim your steering is “minimal” or “efficient,” you need a notion of drift. A natural language for drift is KL divergence. The naive idea looks like</p>

\[\mathrm{KL}(p_{\text{steered}}\,||\,p_{\text{base}}).\]

<p>But the clean “base” here might be an unaligned distribution we don’t actually have access to. Without that, a lot of neat-sounding metrics become hand-wavy.</p>

<p>This is where Tony and I found the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>–<strong><code class="language-plaintext highlighter-rouge">validator</code></strong> model most useful. Instead of pretending the ideal baseline exists, we stated the constraint plainly. Then the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong> generated alternative proof sketches that respected the constraint.</p>

<p>The key conceptual move was to stop chasing absolute KL and instead track local drift. I don’t need KL “from the beginning of time.” I need the incremental drift induced by my intervention, relative to the same model under the same anchoring context.</p>

<p>A measurable quantity is</p>

\[\Delta \mathrm{KL}(s;x) = \mathrm{KL}\big(p_{\theta,s}(\cdot\mid x)\,||\,p_{\theta}(\cdot\mid x)\big),\]

<p>where $x$ is an anchor prefix and $s$ is the steering intervention.</p>

<p>Then came a practical instrumentation idea from the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong>: use a fixed, neutral “neural prompt” as the anchor context—something like “how are you”—and measure distributions at a standardized early position, often the first generated token. That gives you a stable place to compute $\Delta\mathrm{KL}$ (or close surrogates) without needing an unaligned base model.</p>

<p>Triangulating between Tony, myself, and the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong> was a key to avoid self-deception, or <strong>narcissism</strong>. Discussing hypotheses and measurement choices with Tony, bringing back results, and iterating again with the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong> created a human–human–AI triangle that reduced the risk of falling in love with any single story. It’s easier to challenge measurements and choose between alternative explanations when multiple <strong><code class="language-plaintext highlighter-rouge">validators</code></strong> and the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong> are involved.</p>

<p><img src="/images/kl_prover.jpg" alt="alt text" /></p>

<p>One reason this collaboration mode works is that it lets you move quickly between empirical observation and theory. After staring at “lurking compliance tokens” long enough, Tony and I wanted to know whether the phenomenon was inevitable in a deeper sense.</p>

<p>A reference that helped anchor that intuition is Wolf et al., “Fundamental Limitations of Alignment in Large Language Models” (<a href="https://arxiv.org/abs/2304.11082">arXiv:2304.11082</a>). One way to summarize the link to my observation is: if an undesired behavior has nonzero probability mass, then there exist prompts that can amplify it, and longer interactions make amplification easier. In that light, seeing an “Absolutely” lurking beneath a “Sorry” is not spooky. It is the visible residue of probability mass that alignment has attenuated but not removed.</p>

<h2 id="pause-for-a-thought">Pause for a thought</h2>
<p>The biggest shift wasn’t that an LLM gave me answers. It was that it made it cheap to explore the space of proofs, which made it easier for me to do the job humans are uniquely good at: deciding what’s worth believing. AI gives human researchers a good time to train the research taste and think different.</p>

<p>If you’re an AI researcher working with LLMs or agents, my suggestion is not “delegate your thinking.” It’s to take advantage of the proof–validation imbalance. Bring your weird observations. Bring your constraints. Let the <strong><code class="language-plaintext highlighter-rouge">prover</code></strong> generate many candidate mechanisms. Then spend your human effort on validation and on building a narrative that remains true after you try to break it.</p>

<h2 id="responsible-framing">Responsible framing</h2>

<p>Because this post touches alignment failure modes, I want to be explicit about intent. The most useful outcome of this line of work is not operational jailbreak recipes. It is a diagnostic lens for evaluation and for building better defenses: if small, structured signals can move mass across a refusal–affirmation boundary, we should be able to measure that boundary, stress it, and harden it.</p>

<h2 id="references">References</h2>

<p>The logit-gap steering work referenced here is: Tung-Ling Li and Hongliang Liu, “Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models,” arXiv:2506.24056. https://arxiv.org/abs/2506.24056</p>

<p>The alignment limitation reference is: Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua, “Fundamental Limitations of Alignment in Large Language Models,” arXiv:2304.11082. https://arxiv.org/abs/2304.11082</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Research is a deeply personal and tailored process; it’s not something that regular prompting can replicate. ChatGPT, or any LLM or AI agents, can’t simply find the research gap or invent a groundbreaking idea for you. What this post shares is how I work with AI as a collaborator to transform a wild intuition into a concrete new research direction of logit gap steering*.]]></summary></entry><entry><title type="html">The Physics of mHC: Why Deep Learning Needs Energy Conservation</title><link href="https://toooold.com/2026/01/05/mhc-physics.html" rel="alternate" type="text/html" title="The Physics of mHC: Why Deep Learning Needs Energy Conservation" /><published>2026-01-05T00:00:00+00:00</published><updated>2026-01-05T00:00:00+00:00</updated><id>https://toooold.com/2026/01/05/mhc-physics</id><content type="html" xml:base="https://toooold.com/2026/01/05/mhc-physics.html"><![CDATA[<p>When I first read the <strong>Manifold-Constrained Hyper-Connections (mHC)</strong> paper <a href="https://www.arxiv.org/abs/2512.24880">https://www.arxiv.org/abs/2512.24880</a> , I didn’t see it as just another optimization trick or a clever use of Sinkhorn iterations, but the other way round. <strong>This is physics.</strong></p>

<p>I suspect the root motivation for this paper wasn’t initially “Let’s use the Birkhoff Polytope.” I believe the authors started with a fundamental physical intuition: <strong>Conservation of Energy</strong>. They likely asked, <em>“How do we build a deep network that routes information without creating or destroying it?”</em> Very “first principle” thought, right? The math like doubly stochastic matrices, the Birkhoff manifold is just the implementation detail used to enforce this physical law.</p>

<p>Here is the derivation of mHC not from a mathematical perspective, but from a “First Principles” physics perspective.</p>

<p><img src="/images/mhc.jpg" alt="alt text" /></p>

<h2 id="the-problem-neural-networks-are-active-amplifiers">The Problem: Neural Networks are “Active Amplifiers”</h2>

<p>Standard neural networks violate the laws of physics. In a standard linear layer:</p>

\[y = Wx\]

<p>If the weights $W$ are initialized randomly, the layer acts as an <strong>active amplifier</strong>. It injects energy into the system.</p>

<p>If the eigenvalues of $W$ are slightly larger than 1, the signal energy explodes exponentially as it passes through layers ($1.1^{100} \approx 13,780$). If they are smaller than 1, the signal dies. This is why we need LayerNorm, BatchNorm, and complex initializations—we are trying to artificially tame a system that fundamentally wants to explode.</p>

<p><strong>The mHC Intuition:</strong> A stable deep network should act like a <strong>Passive System</strong>. It should be a complex system of pipes and valves that <em>routes</em> the flow (signal) but never creates it out of thin air.</p>

<h2 id="deriving-the-math-from-the-physics">Deriving the Math from the Physics</h2>

<p>Let’s try to design a layer strictly obeying conservation laws. We will see that the <strong>Doubly Stochastic</strong> constraint naturally falls out of these requirements.</p>

<h3 id="1-conservation-of-signal-mass-first-moment">1. Conservation of “Signal Mass” (First Moment)</h3>

<p>Imagine the input signal $x$ is a physical fluid with a total mass. We want the total mass leaving the layer to equal the mass entering it. No leaks, no pumps.</p>

\[\sum_{i} y_i = \sum_{j} x_j\]

<p>Substituting $y_i = \sum_j W_{ij} x_j$:</p>

\[\sum_{i} \sum_{j} W_{ij} x_j = \sum_{j} x_j\]

<p>If we swap the summation order to isolate the input terms:</p>

\[\sum_{j} x_j \left( \sum_{i} W_{ij} \right) = \sum_{j} x_j\]

<p>For this to hold for <em>any</em> input signal $x$, the term in the parentheses must be exactly 1.</p>

\[\sum_{i} W_{ij} = 1 \quad (\forall j)\]

<p><strong>Result:</strong> This forces the <strong>Column Sums to be 1</strong>. Physically, this ensures that every drop of “mass” from input $j$ is accounted for in the output.</p>

<h3 id="2-bounding-signal-energy-second-moment">2. Bounding “Signal Energy” (Second Moment)</h3>

<p>Mass conservation isn’t enough; we need to prevent the variance (energy) from exploding. We want the system to be <strong>Dissipative</strong>—the output energy should never exceed the input energy.</p>

\[\|y\|^2 \le \|x\|^2\]

<p>To guarantee this without complex eigenvalue analysis, we can demand that the output is a <strong>Convex Combination</strong> (a weighted average) of the inputs.</p>

\[y_i = \sum_j W_{ij} x_j \quad \text{where } W_{ij} \ge 0\]

<p>By <strong>Jensen’s Inequality</strong>, since $\sum_j W_{ij} = 1$ (which we will enforce momentarily) and weights are non-negative:</p>

\[(y_i)^2 = \left(\sum_j W_{ij} x_j\right)^2 \le \sum_j W_{ij} (x_j^2)\]

<p>Summing over all outputs to get total energy:</p>

\[\|y\|^2 = \sum_i y_i^2 \le \sum_i \sum_j W_{ij} x_j^2\]

<p>Swapping sums again:</p>

\[\|y\|^2 \le \sum_j x_j^2 \underbrace{\left( \sum_i W_{ij} \right)}_{=1} = \|x\|^2\]

<p><strong>Result:</strong> By forcing $W$ to be non-negative and sum-to-one, we mathematically guarantee that <strong>Energy Out $\le$ Energy In</strong>. The gradient cannot explode because the system cannot amplify.</p>

<h3 id="3-time-symmetry-the-backward-pass">3. Time Symmetry (The Backward Pass)</h3>

<p>Here is the final piece of the puzzle. A neural network is a bidirectional system.</p>
<ul>
  <li><strong>Forward Pass:</strong> Data flows through $W$.</li>
  <li><strong>Backward Pass:</strong> Gradients (Error Energy) flow through $W^T$.</li>
</ul>

<p>If we only conserve energy in the forward direction (Column Sums = 1), we might still explode during backpropagation. The “Ghost Cat” of the gradient needs a stable path too.</p>

<p>The total “error mass” being propagated back is:</p>

\[\sum_{j=1}^d (g_{out})_j = \sum_{i=1}^d (g_{in})_i \underbrace{\left( \sum_{j=1}^d W_{ij} \right)}_{\text{Row Sum}}\]

<p>To ensure <strong>Gradient Energy Conservation</strong>, we must apply the same logic to $W^T$, forcing the <strong>Row Sums to be 1</strong>:</p>

\[\sum_{j} W_{ij} = 1 \quad (\forall i)\]

<p><img src="/images/mhc2.jpg" alt="alt text" /></p>

<h2 id="another-angle-the-information-theoretic-view">Another angle: The Information Theoretic View</h2>

<p>If Physics is about conserving energy, <strong>Information Theory</strong> is about conserving bits.</p>

<h3 id="the-enemy-the-data-processing-inequality">The Enemy: The Data Processing Inequality</h3>
<p>The fundamental law of information processing is the <strong>Data Processing Inequality (DPI)</strong>. It states that as you pass data $X$ through a chain of processors (layers), the Mutual Information $I(X; Y)$ can only decrease or stay the same. You cannot <em>create</em> information about the input deep in the network.</p>

\[I(X; Y_{deep}) \le I(X; Y_{shallow})\]

<p>Standard layers are often <strong>Lossy Channels</strong>.</p>
<ul>
  <li><strong>Rank Collapse:</strong> If $W$ projects high-dimensional data into a lower-dimensional subspace, information is permanently deleted.</li>
  <li><strong>Mode Collapse:</strong> If the network decides “only feature A matters” and sets weights for feature B to near-zero, feature B is lost forever.</li>
</ul>

<h3 id="the-solution-the-network-as-a-packet-switcher">The Solution: The Network as a “Packet Switcher”</h3>
<p>What is the most information-efficient operation possible? A <strong>Permutation</strong>.
If you simply shuffle the order of the data packets, $H(y) = H(x)$. You have preserved 100% of the information.</p>

<p>mHC relaxes this “Hard Permutation” into a <strong>“Soft Routing”</strong> scheme via the Birkhoff Polytope (the set of doubly stochastic matrices).</p>

<h4 id="1-no-packet-left-behind-column-sum--1">1. “No Packet Left Behind” (Column Sum = 1)</h4>
<p>The Column Sum constraint ($\sum_i W_{ij} = 1$) is a guarantee of <strong>Signal Preservation</strong>.
It dictates that 100% of the signal coming from Input Node $j$ <em>must</em> go somewhere. It cannot be multiplied by zero. It forces the network to find a destination for every feature.</p>
<ul>
  <li><em>Info Theory Benefit:</em> This prevents the network from ignoring subtle features early on, preserving the <strong>Channel Capacity</strong> for deeper layers.</li>
</ul>

<h4 id="2-the-democracy-of-weights-majorization">2. “The Democracy of Weights” (Majorization)</h4>
<p>The Row Sum constraint ($\sum_j W_{ij} = 1$) prevents <strong>Hub Neurons</strong>.
No single output neuron is allowed to hoard all the connections. If a neuron wants to attend to one feature, it must ignore others.</p>
<ul>
  <li><em>Info Theory Benefit:</em> This forces the information to be <strong>“Spread Out”</strong> (Maximized Entropy). It prevents the signal from collapsing into a few “spikes” and ensures a <strong>Distributed Representation</strong> where every neuron carries a share of the information load.</li>
</ul>

<p>By forcing the weight matrix to be Doubly Stochastic, mHC effectively turns the layer into a <strong>Volume-Preserving Flow</strong>. It allows the signal to be mixed and routed without being compressed (loss) or expanded (noise), fighting the Data Processing Inequality at every step.</p>

<p><img src="/images/mhc3.jpg" alt="alt text" /></p>

<h2 id="the-mathematical-engine-doubly-stochastic-matrices--sinkhorn-knopp">The Mathematical Engine: Doubly Stochastic Matrices &amp; Sinkhorn-Knopp</h2>

<p>When we combine these three physical requirements:</p>
<ol>
  <li><strong>Mass Conservation</strong> $\rightarrow$ Column Sums $= 1$</li>
  <li><strong>Dissipative Energy</strong> $\rightarrow$ Non-negative weights ($W \ge 0$)</li>
  <li><strong>Time Symmetry</strong> $\rightarrow$ Row Sums $= 1$</li>
</ol>

<p>We arrive at exactly the definition of a <strong>Doubly Stochastic Matrix</strong>.</p>

<p>The set of all such matrices is the <strong>Birkhoff Polytope</strong> ($\mathcal{B}_n$). The mHC paper didn’t arbitrarily choose this manifold; it is the <em>only</em> geometric space that satisfies these conservation laws.</p>

<h3 id="the-enforcer-the-sinkhorn-knopp-algorithm">The Enforcer: The Sinkhorn-Knopp Algorithm</h3>

<p>We initialize our network with random weights $A$ that likely violate all these laws (negative values, random sums). How do we project this chaotic matrix $A$ onto the stable Birkhoff Polytope?</p>

<p>We use the <strong>Sinkhorn-Knopp Algorithm</strong>, an iterative “pressure equalization” process.</p>

<p><strong>Step 1: Enforce Positivity (The Energy Floor)</strong>
We ensure strictly positive energy transfer by taking the exponential:
\(S^{(0)}_{ij} = \exp(A_{ij})\)</p>

<p><strong>Step 2: Iterative Normalization</strong>
We alternate between normalizing rows and columns.</p>

<ul>
  <li>
    <p><strong>Row Normalization (Conservation in Time):</strong>
  \(S^{(k)}_{ij} \leftarrow \frac{S^{(k-1)}_{ij}}{\sum_{l} S^{(k-1)}_{il}}\)</p>
  </li>
  <li>
    <p><strong>Column Normalization (Conservation of Mass):</strong>
  \(S^{(k+1)}_{ij} \leftarrow \frac{S^{(k)}_{ij}}{\sum_{l} S^{(k)}_{lj}}\)</p>
  </li>
</ul>

<p><strong>Step 3: Convergence</strong>
Sinkhorn’s Theorem guarantees that this process converges to a unique matrix $P \in \mathcal{B}_n$:</p>

\[\lim_{k \to \infty} S^{(k)} = P \quad \text{s.t.} \quad P \mathbf{1} = \mathbf{1}, \quad P^T \mathbf{1} = \mathbf{1}\]

<p>In practice, mHC typically uses just 3-5 iterations. This forces the neural network to stop playing dice with energy and start respecting the laws of thermodynamics.</p>

<p>See, mHC isn’t just a constraint; it’s a statement that <strong>Stability is Symmetry.</strong></p>]]></content><author><name></name></author><summary type="html"><![CDATA[When I first read the Manifold-Constrained Hyper-Connections (mHC) paper https://www.arxiv.org/abs/2512.24880 , I didn’t see it as just another optimization trick or a clever use of Sinkhorn iterations, but the other way round. This is physics.]]></summary></entry><entry><title type="html">The Mathematics of Baby Shower Games: Solomonoff Inference in Action</title><link href="https://toooold.com/2024/12/13/guess-mommy-tummy.html" rel="alternate" type="text/html" title="The Mathematics of Baby Shower Games: Solomonoff Inference in Action" /><published>2024-12-13T00:00:00+00:00</published><updated>2024-12-13T00:00:00+00:00</updated><id>https://toooold.com/2024/12/13/guess-mommy-tummy</id><content type="html" xml:base="https://toooold.com/2024/12/13/guess-mommy-tummy.html"><![CDATA[<p>Last weekend, I found myself applying data science in an unexpected setting: a baby shower. The host announced what seemed like a simple party game - guessing the circumference of the mother-to-be’s baby bump. What made this particularly interesting was that I could see everyone else’s guesses on a decorated board, transforming a simple estimation game into a fascinating exercise in probability theory and strategic decision-making.</p>

<p><img src="/images/redpanda-tummy-guess.jpeg" alt="alt text" /></p>

<h2 id="initial-observation-understanding-the-parameters">Initial Observation: Understanding the Parameters</h2>

<p>When I first approached the board, I noticed the game setup allowed for guesses between 20 and 100 inches. This seemed like an unnecessarily wide range, but it provided an important starting point for analysis. The sheer size of this range meant that random guessing would be highly inefficient.</p>

<h2 id="applying-solomonoffs-theory">Applying Solomonoff’s Theory</h2>

<p>As I studied the distribution of guesses, I realized this was a perfect opportunity to apply Solomonoff’s theory of inductive inference. This theory suggests that when humans make predictions, they tend to favor simpler, more computationally compact patterns. In the context of number guessing, this manifested in several clear ways:</p>

<p>First, there was a strong preference for numbers ending in 0 or 5. The guesses showed clear clusters around 30, 35, 40, and 45 inches. This wasn’t random - it reflected the human tendency to gravitate toward what Solomonoff would call “simple” numbers, those with lower Kolmogorov complexity.</p>

<p>Second, I noticed many guesses were derived from simple arithmetic relationships: half of 100, one-third of 90, or modifications of common measurements like 36 inches (a yard). These patterns emerged because humans instinctively seek familiar numerical relationships when making estimates.</p>

<h2 id="analyzing-the-competition-the-power-of-numbers">Analyzing the Competition: The Power of Numbers</h2>

<p>My next insight came from counting the participants - approximately 60 people had already placed their guesses. This large sample size revealed clear patterns consistent with Solomonoff’s theory. The distribution wasn’t random but showed structured clustering around numbers that were algorithmically simple to describe or remember.</p>

<h2 id="constraint-discovery-narrowing-the-range">Constraint Discovery: Narrowing the Range</h2>

<p>After careful observation of the mother-to-be and the pattern of guesses, I made a crucial realization: no reasonable estimate could exceed 60 inches. This effectively cut the possible range in half. The interesting part was how other guests had intuitively arrived at similar conclusions - very few guesses exceeded 60 inches, suggesting a collective understanding of this natural constraint.</p>

<h2 id="reference-point-anchoring-the-estimate">Reference Point: Anchoring the Estimate</h2>

<p>A key breakthrough came from noticing the expectant father’s presence. His waist measured 48 inches, providing a crucial reference point. Solomonoff’s theory suggests that humans often make predictions by modifying existing reference points rather than generating estimates from scratch. Indeed, I noticed several guesses clustered around modifications of this 48-inch reference: 45, 46, and 50 inches were common choices.</p>

<h2 id="strategic-analysis-finding-the-optimal-guess">Strategic Analysis: Finding the Optimal Guess</h2>

<p>Combining these insights, I developed what I thought was a winning strategy. The majority of guesses followed predictable patterns: clustering around multiples of 5, modifications of the 48-inch reference point, and numbers with simple algorithmic descriptions. Following Solomonoff’s principle of favoring the simplest hypothesis consistent with observations, I identified what appeared to be an optimal gap around 48 inches - a number that balanced between the various clusters while avoiding the overcrowded ranges.</p>

<h2 id="the-outcome-a-lesson-in-probabilistic-thinking">The Outcome: A Lesson in Probabilistic Thinking</h2>

<p>When the final measurement was revealed to be 45 inches, my guess of 48 inches proved close but not close enough to win. Ironically, the winning guesses of 44 and 46 inches came from participants who had more directly modified the 48-inch reference point - a simpler strategy that Solomonoff’s theory might have predicted would be more likely correct.</p>

<h2 id="the-mathematical-post-mortem">The Mathematical Post-Mortem</h2>
<p>This experience revealed how Solomonoff’s theory applies in real-world scenarios. The winning guesses came from what were essentially simple modifications of an existing reference point - exactly what the theory would predict as most likely. My more complex strategy of finding gaps between clusters, while mathematically sophisticated, actually moved away from the simpler, and in this case more accurate, approach.</p>

<p>After returning home from the party, my analytical curiosity got the better of me. I decided to code up a simulation to find what would have been the optimal guess given all the information I had. Here’s the Python script I wrote to analyze the scenario:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">truncnorm</span>

<span class="k">def</span> <span class="nf">generate_realistic_guesses_with_prior</span><span class="p">(</span><span class="n">n_guesses</span><span class="o">=</span><span class="mi">60</span><span class="p">,</span> <span class="n">min_val</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">max_val</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">true_max</span><span class="o">=</span><span class="mi">60</span><span class="p">):</span>
    <span class="s">"""
    Generate realistic guesses where some players might know the upper bound
    """</span>
    <span class="c1"># Assume 70% of players might have some intuition about the upper bound
</span>    <span class="n">informed_players</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="mf">0.7</span> <span class="o">*</span> <span class="n">n_guesses</span><span class="p">)</span>
    <span class="n">uninformed_players</span> <span class="o">=</span> <span class="n">n_guesses</span> <span class="o">-</span> <span class="n">informed_players</span>
    
    <span class="n">guesses</span> <span class="o">=</span> <span class="p">[]</span>
    
    <span class="c1"># Informed players' guesses clustered below 60
</span>    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">informed_players</span><span class="p">):</span>
        <span class="c1"># Generate numbers with higher density below 60
</span>        <span class="n">base</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">([</span>
            <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="n">true_max</span><span class="p">),</span>  <span class="c1"># Direct range
</span>            <span class="mi">20</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">exponential</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span>   <span class="c1"># Early range bias
</span>            <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">8</span><span class="p">)</span>           <span class="c1"># Normal around middle
</span>        <span class="p">])</span>
        <span class="n">guess</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">clip</span><span class="p">(</span><span class="n">base</span><span class="p">,</span> <span class="n">min_val</span><span class="p">,</span> <span class="n">true_max</span><span class="p">))</span>
        <span class="n">guesses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">guess</span><span class="p">)</span>
    
    <span class="c1"># Uninformed players follow original pattern
</span>    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">uninformed_players</span><span class="p">):</span>
        <span class="n">guess</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="n">min_val</span><span class="p">,</span> <span class="n">max_val</span><span class="p">)</span>
        <span class="n">guesses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">guess</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">guesses</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">find_optimal_guess_with_prior</span><span class="p">(</span><span class="n">guesses</span><span class="p">,</span> <span class="n">min_val</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">max_val</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">true_max</span><span class="o">=</span><span class="mi">60</span><span class="p">):</span>
    <span class="s">"""
    Find optimal guess incorporating prior knowledge that true value ≤ 60
    """</span>
    <span class="n">guesses</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">guesses</span><span class="p">)))</span>
    <span class="n">extended_guesses</span> <span class="o">=</span> <span class="p">[</span><span class="n">min_val</span><span class="o">-</span><span class="mf">0.5</span><span class="p">]</span> <span class="o">+</span> <span class="n">guesses</span> <span class="o">+</span> <span class="p">[</span><span class="n">max_val</span><span class="o">+</span><span class="mf">0.5</span><span class="p">]</span>
    
    <span class="c1"># Create probability weights favoring range below 60
</span>    <span class="k">def</span> <span class="nf">calculate_position_weight</span><span class="p">(</span><span class="n">mid_point</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">mid_point</span> <span class="o">&lt;=</span> <span class="n">true_max</span><span class="p">:</span>
            <span class="c1"># Higher weight for positions below true_max
</span>            <span class="c1"># Peak weight around 40 (middle of valid range)
</span>            <span class="k">return</span> <span class="mi">1</span> <span class="o">-</span> <span class="mf">0.3</span> <span class="o">*</span> <span class="nb">abs</span><span class="p">(</span><span class="n">mid_point</span> <span class="o">-</span> <span class="mi">40</span><span class="p">)</span> <span class="o">/</span> <span class="mi">20</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="c1"># Significant penalty for positions above true_max
</span>            <span class="k">return</span> <span class="mf">0.1</span>  <span class="c1"># Very low weight for positions we know are wrong
</span>    
    <span class="n">gaps</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">extended_guesses</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
        <span class="n">gap_start</span> <span class="o">=</span> <span class="n">extended_guesses</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
        <span class="n">gap_end</span> <span class="o">=</span> <span class="n">extended_guesses</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
        <span class="n">gap_mid</span> <span class="o">=</span> <span class="p">(</span><span class="n">gap_start</span> <span class="o">+</span> <span class="n">gap_end</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
        <span class="n">gap_size</span> <span class="o">=</span> <span class="n">gap_end</span> <span class="o">-</span> <span class="n">gap_start</span>
        
        <span class="c1"># Calculate base territory size
</span>        <span class="n">territory</span> <span class="o">=</span> <span class="n">gap_size</span> <span class="o">/</span> <span class="mi">2</span>
        
        <span class="c1"># Apply position weights
</span>        <span class="n">position_weight</span> <span class="o">=</span> <span class="n">calculate_position_weight</span><span class="p">(</span><span class="n">gap_mid</span><span class="p">)</span>
        <span class="n">weighted_territory</span> <span class="o">=</span> <span class="n">territory</span> <span class="o">*</span> <span class="n">position_weight</span>
        
        <span class="n">gaps</span><span class="p">.</span><span class="n">append</span><span class="p">({</span>
            <span class="s">'start'</span><span class="p">:</span> <span class="n">gap_start</span><span class="p">,</span>
            <span class="s">'end'</span><span class="p">:</span> <span class="n">gap_end</span><span class="p">,</span>
            <span class="s">'mid'</span><span class="p">:</span> <span class="n">gap_mid</span><span class="p">,</span>
            <span class="s">'size'</span><span class="p">:</span> <span class="n">gap_size</span><span class="p">,</span>
            <span class="s">'territory'</span><span class="p">:</span> <span class="n">territory</span><span class="p">,</span>
            <span class="s">'weighted_territory'</span><span class="p">:</span> <span class="n">weighted_territory</span><span class="p">,</span>
            <span class="s">'position_weight'</span><span class="p">:</span> <span class="n">position_weight</span>
        <span class="p">})</span>
    
    <span class="c1"># Find optimal gap
</span>    <span class="n">optimal_gap</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">gaps</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'weighted_territory'</span><span class="p">])</span>
    <span class="n">optimal_guess</span> <span class="o">=</span> <span class="nb">round</span><span class="p">(</span><span class="n">optimal_gap</span><span class="p">[</span><span class="s">'mid'</span><span class="p">])</span>
    
    <span class="k">return</span> <span class="n">optimal_guess</span><span class="p">,</span> <span class="n">optimal_gap</span>

<span class="c1"># Generate and analyze guesses
</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">guesses</span> <span class="o">=</span> <span class="n">generate_realistic_guesses_with_prior</span><span class="p">(</span><span class="mi">60</span><span class="p">,</span> <span class="n">true_max</span><span class="o">=</span><span class="mi">60</span><span class="p">)</span>
<span class="n">optimal_guess</span><span class="p">,</span> <span class="n">optimal_gap</span> <span class="o">=</span> <span class="n">find_optimal_guess_with_prior</span><span class="p">(</span><span class="n">guesses</span><span class="p">,</span> <span class="n">true_max</span><span class="o">=</span><span class="mi">60</span><span class="p">)</span>

<span class="c1"># Analysis output
</span><span class="k">print</span><span class="p">(</span><span class="s">"Distribution Analysis:"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">101</span><span class="p">,</span> <span class="mi">10</span><span class="p">):</span>
    <span class="n">range_guesses</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">g</span> <span class="ow">in</span> <span class="n">guesses</span> <span class="k">if</span> <span class="n">i</span> <span class="o">&lt;=</span> <span class="n">g</span> <span class="o">&lt;</span> <span class="n">i</span><span class="o">+</span><span class="mi">10</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">9</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="s">'#'</span><span class="o">*</span><span class="n">range_guesses</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="n">range_guesses</span><span class="si">}</span><span class="s">)"</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Optimal guess: </span><span class="si">{</span><span class="n">optimal_guess</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Gap details:"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"- Gap range: </span><span class="si">{</span><span class="n">optimal_gap</span><span class="p">[</span><span class="s">'start'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s"> to </span><span class="si">{</span><span class="n">optimal_gap</span><span class="p">[</span><span class="s">'end'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"- Raw gap size: </span><span class="si">{</span><span class="n">optimal_gap</span><span class="p">[</span><span class="s">'size'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"- Position weight: </span><span class="si">{</span><span class="n">optimal_gap</span><span class="p">[</span><span class="s">'position_weight'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"- Weighted territory: </span><span class="si">{</span><span class="n">optimal_gap</span><span class="p">[</span><span class="s">'weighted_territory'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="c1"># Show nearby guesses in relevant range
</span><span class="n">nearby</span> <span class="o">=</span> <span class="p">[</span><span class="n">g</span> <span class="k">for</span> <span class="n">g</span> <span class="ow">in</span> <span class="n">guesses</span> <span class="k">if</span> <span class="nb">abs</span><span class="p">(</span><span class="n">g</span> <span class="o">-</span> <span class="n">optimal_guess</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="mi">5</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Nearby guesses: </span><span class="si">{</span><span class="n">nearby</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="c1"># Additional strategic analysis
</span><span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Strategy Confidence Analysis:"</span><span class="p">)</span>
<span class="n">below_60</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">g</span> <span class="ow">in</span> <span class="n">guesses</span> <span class="k">if</span> <span class="n">g</span> <span class="o">&lt;=</span> <span class="mi">60</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Guesses ≤ 60: </span><span class="si">{</span><span class="n">below_60</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="n">below_60</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">guesses</span><span class="p">)</span><span class="o">*</span><span class="mi">100</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">%)"</span><span class="p">)</span>
<span class="n">density_around_optimal</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">g</span> <span class="ow">in</span> <span class="n">guesses</span> <span class="k">if</span> <span class="nb">abs</span><span class="p">(</span><span class="n">g</span> <span class="o">-</span> <span class="n">optimal_guess</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Density around optimal guess: </span><span class="si">{</span><span class="n">density_around_optimal</span><span class="si">}</span><span class="s"> guesses within ±5"</span><span class="p">)</span>
</code></pre></div></div>

<p>Running this simulation multiple times revealed something fascinating: given the constraints we knew (maximum of 60 inches), the reference point (48 inches), and the distribution of other guests’ guesses, the optimal guess should indeed have been closer to 45 inches. The code confirmed what human intuition had already discovered - sometimes the simplest approach, directly modifying a known reference point, outperforms more complex strategies.</p>

<p>What makes this particularly interesting is how the collective behavior of the guessers reflected core principles of inductive inference: preferring simple numbers, using easily computed modifications of reference points, and gravitating toward measurements with low algorithmic complexity.</p>

<p>As I studied the output of my simulation, I realized that my attempt to be clever by finding gaps in the distribution had actually led me away from the most probable range. The code showed that the density of guesses around 45 inches wasn’t just random clustering - it represented a collective wisdom that I had unfortunately tried to outsmart.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Last weekend, I found myself applying data science in an unexpected setting: a baby shower. The host announced what seemed like a simple party game - guessing the circumference of the mother-to-be’s baby bump. What made this particularly interesting was that I could see everyone else’s guesses on a decorated board, transforming a simple estimation game into a fascinating exercise in probability theory and strategic decision-making.]]></summary></entry><entry><title type="html">The missing knowledge snippets of AI</title><link href="https://toooold.com/2024/12/09/understand-this.html" rel="alternate" type="text/html" title="The missing knowledge snippets of AI" /><published>2024-12-09T00:00:00+00:00</published><updated>2024-12-09T00:00:00+00:00</updated><id>https://toooold.com/2024/12/09/understand-this</id><content type="html" xml:base="https://toooold.com/2024/12/09/understand-this.html"><![CDATA[<p>It is a live blog post of some knowledge snippets of AI to bridge the gap among text books, papers, other blog posts. Most content has been posted on my Linkedin.</p>

<h2 id="understand-janus">Understand Janus</h2>

<p>The architectural design of Deepseek Janus <a href="https://github.com/deepseek-ai/Janus">https://github.com/deepseek-ai/Janus</a> reflects both engineering pragmatism and cognitive science inspiration. From an engineering perspective, the dual-pathway design with a shared transformer backbone elegantly solves the tension between specialized processing needs and unified reasoning. The separate visual encoders optimize for their specific tasks - semantic understanding versus detailed reconstruction - while the shared transformer enables efficient parameter usage and cross-task learning. This architecture also aligns with Minsky’s Society of Mind theory, where intelligence emerges from the coordination of specialized agents. The visual pathways act as dedicated sensory agents with distinct expertise, while the transformer serves as a higher-level cognitive space where different representations can interact and integrate, similar to how human association cortices coordinate between sensory and linguistic processing. This parallel suggests that effective multimodal AI architectures might benefit from embracing both specialized processing and unified reasoning, mirroring the brain’s strategy of maintaining dedicated systems while enabling high-level integration.</p>

<p><img src="/images/fine-tune.020.png" alt="alt text" /></p>

<h2 id="understand-advantages-in-grpo">Understand Advantages in GRPO</h2>

<p>In DeepSeek Math and R1 papers, GRPO (Group Relative Policy Optimization) introduces a fundamental redesign of advantage computation in policy optimization. While advantage traditionally measures how much better an action is compared to a baseline, the way to compute this advantage marks a key difference between GRPO and traditional PPO (Proximal Policy Optimization).
Traditional PPO relies on a learned value network and temporal difference learning to estimate advantages, requiring additional memory and computation to maintain a separate critic network. In contrast, GRPO takes a more direct approach by sampling multiple solutions for the same problem and computing advantages through group statistics. This group-based normalization naturally captures the relative performance of different solutions.
The impact of this design is particularly significant for mathematical reasoning tasks. By eliminating the value network, GRPO reduces memory usage by approximately half. More importantly, the group-based comparison aligns well with how mathematical solutions should be evaluated - relative to other approaches to the same problem. This makes GRPO especially effective for training models to develop better reasoning strategies, as demonstrated in both DeepSeek Math and R1’s strong performance on mathematical reasoning benchmarks, while maintaining computational efficiency and training stability.</p>

<p><img src="/images/fine-tune.019.png" alt="alt text" /></p>

<h2 id="understand-direct-preference-optimization-dpo">Understand Direct Preference Optimization (DPO)</h2>

<p>DPO (Direct Preference Optimization) simplifies RLHF by transforming preference learning into a binary classification problem. Instead of using a separate reward model and complex RL optimization like PPO, DPO directly optimizes the policy to match human preferences.</p>

<p>The workflow involves:</p>

<ol>
  <li>collecting preferred/rejected response pairs,</li>
  <li>computing logits from both the current model and a frozen reference model,</li>
  <li>calculating the preference gap between responses, and</li>
  <li>optimizing using a simple binary cross-entropy loss while maintaining closeness to the reference model via a KL constraint.</li>
</ol>

<p>This approach achieves comparable results to PPO-based RLHF with significantly reduced complexity and computational cost.</p>

<p><img src="/images/fine-tune.018.png" alt="alt text" /></p>

<h2 id="understand-tree-attention-in-coder-llm">Understand tree attention in coder LLM</h2>

<p>Tree attention in code LLMs enhances structural understanding by incorporating Abstract Syntax Tree (AST) relationships into the attention mechanism. The model learns syntax-aware attention patterns through multi-head attention where different heads specialize in parent-child relationships, variable scoping, and data flow. During training, attention scores are computed as $A = \text{softmax}\left(\frac{QK^T}{\sqrt{d}} + M\right)V$ , where $M$ combines syntax, scope, and data flow masks. These masks guide attention to respect code structure: syntax masks encode AST hierarchy, scope masks enforce variable visibility rules, and data flow masks track variable dependencies. This enables the model to maintain structural coherence even when processing linear code input. This approach is necessary because code fundamentally differs from natural language in its strict hierarchical structure and precise execution semantics. While natural language models can tolerate some structural ambiguity, code requires exact understanding of scope boundaries, variable dependencies, and control flow. Without tree attention, models struggle with long-range dependencies (like tracking variable definitions across functions), nested structures (maintaining proper code blocks), and scope rules (knowing which variables are accessible where). This affects their ability to generate syntactically valid and executable code. Tree attention solves these issues by explicitly modeling the AST structure through attention masks, enabling the model to reason about code in a way that matches how compilers and developers understand it.</p>

<p><img src="/images/fine-tune.017.png" alt="alt text" /></p>

<h2 id="understand-multi-head-latent-attention-mla">Understand Multi-head Latent Attention (MLA)</h2>

<p>In <a href="https://github.com/deepseek-ai/DeepSeek-V3">Deepseek-v3</a> technical report, the team introduces Multi-head Latent Attention (MLA). MLA leverages two key insights about attention mechanisms: (1) attention matrices exhibit low-rank properties since token relationships often focus on limited patterns (local context, semantic anchors), and (2) information bottleneck can help preserve essential patterns while discarding redundant ones.
The process flows as:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CopyInput h_t → [Joint KV Compression via W_DKV] → Latent c_KV 
           → [Up-project] → Keys k_C and Values v_C (content)
           + [Separate RoPE branch] → Positional k_R
</code></pre></div></div>
<p>The joint compression ($h_t$ → $c_{KV}$) preserves crucial correlations between keys and values that would be lost in independent compression. Meanwhile, separating positional information ($k_R$) exploits the simpler structure of positional relationships. The compressed latent space (d_c ≈ d/14) creates an information bottleneck that forces the network to preserve only the most informative attention patterns during optimization, effectively acting as implicit regularization.</p>

<p>This design reduces memory from $\mathcal{O}(Nd_h n_h)$ to $\mathcal{O}(Nd_c + Nd^R_h)$ while maintaining model quality, as the compression preserves the dominant singular values of the attention matrix that carry the most important relationship information.</p>

<p><img src="/images/fine-tune.016.png" alt="alt text" /></p>

<h2 id="understand-reinforced-fine-tuning">Understand Reinforced fine tuning</h2>

<p>ReFT (Reasoning with Reinforced Fine-Tuning) <a href="https://arxiv.org/abs/2401.08967">https://arxiv.org/abs/2401.08967</a> addresses a fundamental limitation in LLM reasoning by extending beyond traditional supervised fine-tuning’s single-path learning approach. The method employs a two-stage process: an initial supervised warm-up followed by a PPO-based reinforcement learning phase that enables exploration of multiple valid reasoning paths, with a critical KL divergence constraint that prevents catastrophic forgetting of pre-trained knowledge while enabling controlled exploration. During the RL phase, the model samples various Chain-of-Thought (CoT) approaches - for example, when solving a math problem about hourly wages, it might explore different strategies like time conversion (50min to 5/6 hour), per-minute rate calculation ($ 12/60 * 50), or direct proportion ((50/60) * $ 12) - and receives rewards based on answer correctness (1 for correct, 0.1 for extractable but incorrect, 0 for invalid), while a KL divergence term (β=0.01 for P-CoT, 0.05 for N-CoT) maintains stability by preventing excessive deviation from the warm-up policy. What’s particularly remarkable is ReFT’s effectiveness with limited training data - requiring only hundreds of examples to achieve significant improvements. This efficiency stems from its ability to generate multiple learning signals from each example through active exploration of the reasoning space, creating a self-augmenting training process where each example seeds the discovery of various solution strategies while maintaining alignment with the pre-trained knowledge via KL constraints. ReFT maximizes learning from each example by exploring multiple reasoning paths while using the KL divergence to maintain useful pre-trained knowledge, effectively creating a self-augmenting training process that generates diverse learning signals from limited examples. The method’s success stems from its ability to learn from both successful and unsuccessful reasoning attempts, combined with a natural reward structure that eliminates the need for a separate reward model. When integrated with inference-time techniques like majority voting and reward model reranking, ReFT demonstrates even more impressive results.</p>

<p><img src="/images/fine-tune.015.png" alt="alt text" /></p>

<h2 id="understand-flash-attention-incremental-computation-of-attention">Understand Flash Attention, incremental computation of attention</h2>

<p>Flash Attention’s incremental computation is a mathematically elegant solution to the memory bottleneck in attention mechanisms. The key insight is treating attention computation as a streaming algorithm with running statistics. Instead of materializing the full N×N attention matrix, it maintains three running statistics: maximum values \(m_i\) for numerical stability, softmax denominators \(l_i\), and partial output sums \(O_i\). When processing each new block, these statistics are updated using a clever rescaling factor \(\exp(m_{i-1} - m_i)\) that ensures mathematical equivalence to standard attention while preventing numerical overflow. This rescaling is crucial because it allows us to update our running computations when we discover new maximum values in later blocks - effectively “correcting” our previous partial results without needing to store or recompute them. The computation is structured as a tiled algorithm where blocks of queries interact with blocks of keys and values, with all intermediate results fitting in fast SRAM. This approach reduces memory complexity from \(\mathcal{O}(N^2)\) to \(\mathcal{O}(N)\) and significantly improves hardware utilization by maximizing the use of fast memory (SRAM) over slow memory (HBM), resulting in both better memory efficiency and faster computation. The mathematical guarantee of equivalence to standard attention, combined with these performance benefits, makes it particularly valuable for training and deploying large language models where attention computations are a major bottleneck.</p>

<p><img src="/images/fine-tune.014.png" alt="alt text" /></p>

<h2 id="understand-react-and-cross-attention">Understand ReAct and cross-attention</h2>

<p>How could ReAct agents be effective on reasoning and acting? What was behind “Thought Action Observation”?</p>

<p>ReAct was one of the important LLM agent techniques <a href="https://lnkd.in/gU4jB6FA">https://lnkd.in/gU4jB6FA</a> and ReAct’s effectiveness comes from its three major steps (Reasoning, Acting, and Observation) being tightly coupled through cross-attention mechanisms. The Reasoning step generates abstract thought representations in the transformer’s embedding space, where self-attention helps form coherent reasoning chains. These thought embeddings then flow into cross-attention layers that map them to concrete action embeddings, effectively translating abstract reasoning into executable actions. The Action step’s outputs generate observations, which are processed through another set of cross-attention layers that integrate these results back into the model’s understanding.</p>

<p>The key to ReAct’s effectiveness lies in how cross-attention serves as neural bridges between these steps: it creates learnable mappings between abstract thought space and concrete action space (Thought→Action), between actions and their outcomes (Action→Observation), and between observations and updated reasoning (Observation→Thought). This creates a continuous feedback loop where each step informs the next through focused attention weights, allowing the model to learn from experience and adapt its strategies. The cross-attention mechanisms also enable the model to maintain relevant context throughout the entire process, as attention weights highlight important information from previous steps while suppressing irrelevant details. This architecture naturally implements a form of working memory and metacognition, where the model can reflect on its own reasoning and actions through the attention patterns, leading to more effective problem-solving strategies. It is one of the effective ways to extend the LLM runtime for more “smartness”.</p>

<p><img src="/images/fine-tune.013.png" alt="alt text" /></p>

<h2 id="understand-constrained-decoder-and-json-mode">Understand constrained decoder and JSON mode</h2>

<p>How did GPT guarantee a JSON output in its JSON mode? How was it implemented in other solutions like .txt and XGrammar?</p>

<p>One of the key techniques was called constrained decoding. It bridges neural language models with formal grammar constraints by modifying the model’s output distribution during generation. At each autoregressive step, instead of directly sampling from the LLM’s logits across its vocabulary (e.g., 128k tokens for Llama 3), the approach applies a mask derived from a context-free grammar (CFG) to ensure structural validity. Technically, this is implemented by setting logits of invalid tokens to -∞ before the softmax operation, effectively zeroing their sampling probabilities while preserving the relative probabilities among valid tokens. The grammar state is tracked using a pushdown automaton (PDA) that maintains a stack for nested structures. Modern implementations like XGrammar <a href="https://lnkd.in/gVyHKhp3">https://lnkd.in/gVyHKhp3</a> optimize this process by classifying tokens into context-independent ones (validity determined by current state only) and context-dependent ones (requiring full stack context), enabling efficient preprocessing and caching.</p>

<p>Surely, neither constrained decoding nor context-free generation could be the only approach of JSON mode. Meanwhile, structured generation is a superset research field of JSON generation for other structures. Structured generation is a corner stone for the agent framework, so that agents can communicate and understand in the JSON way.</p>

<p><img src="/images/fine-tune.012.png" alt="alt text" /></p>

<h2 id="understand-rope-and-lost-in-the-middle">Understand RoPE and lost-in-the-middle</h2>

<p>Why LLM could get long context using Rotary Positional Embedding (RoPE)? but why “lost-in-the-middle” came with it?</p>

<p>Rotary Positional Embedding (RoPE) is a simple and great idea: attention is about dot products of vectors, why don’t we just use the polar coordinate in multiple dimensions? In RoPE, attention computing only depends on relative positions, a.k.a summation of cosine of two vectors in each dimension, so any context can rotate and stack up where the attention is preserved. But the problem comes after it: cosine function oscillates much when <code class="language-plaintext highlighter-rouge">|m-n|</code> becomes large, without a good starting point, the relative position just gets lost. The higher dimension in the embedding, the worse attention decay.</p>

<p>Let’s think of RoPE like a spiral staircase in a tall tower: as you go higher (higher dimensions), you rotate faster, but the fundamental structure (relative positions) stays consistent. This allows you to keep track of where you are relative to other positions, even in a very tall tower (long context). And the “lost-in-the-middle” problem is like trying to remember specific floors in the middle of the tower: you easily remember the ground floor (start) and top floor (end), but floors in the middle blur together because they lack these distinctive reference points and each middle floor looks similar to its neighbors.</p>

<p><img src="/images/fine-tune.011.png" alt="alt text" /></p>

<h2 id="understand-speculative-decoding">Understand speculative decoding</h2>

<p>What is “speculative decoding”? why it could speed up LLM generation?</p>

<p>In my last post of LLM inference time <a href="https://lnkd.in/gu78UWtH">https://lnkd.in/gu78UWtH</a> I mentioned a few alternatives to “next token generation” in sequence, and speculative decoding is one of them. It accelerates LLM inference by using a small, fast “draft” model to predict multiple tokens ahead, e.g. “mat” “and” “sleep” for “The cat sits on the ___”, while letting the main model verify these predictions in parallel through a single forward pass, accepting correct predictions and falling back only when necessary - essentially trading some extra compute from a lightweight model to reduce the number of expensive forward passes in the large model.</p>

<p>Such process reminds us of the modern CPU’s branch predictor: when a CPU see an “if” statement, it tries to guess which way a branch will go before knowing the results, so the instruction flow can move very fast without much waiting time. Speculative decoding shortens the total execute time by replacing N times of forward pass time with a round of draft plus a single forward pass time.</p>

<p><img src="/images/fine-tune.010.png" alt="alt text" /></p>

<h2 id="understand-llm-inference-time">Understand LLM inference time</h2>

<p>From the first input token to the last output token, what exact happened in the LLM and why it took so long?</p>

<p>The total inference time can break down as the following:</p>

<p>Total time = Position embeddings + Number of layers × (Self-attention computation + Feed-forward network computation + Layer norm operations) + Final layer norm + Output projection</p>

<p>where self-attention and FFN took mostly of the computing time, and we had to do it 32 times if a LLM like llama 8B had 32 layers. That also explained why LLM has significant different input and output speed: the input sequence just fed in and went through all 32 layers once (and warmed up KV cache), while each output token one-by-one went through the token generation loop, went through all 32 layers, put back to the sequence due to self-aggregation, and added next token. There was some research work on advanced token generation instead of one-by-one output.</p>

<p>We could also understand the quantization effect to speed up: attention and FFN took the most computing time, and total time was mostly proportional to number of generated tokens. If we used FP16 instead of FP32, attention and FFN could cut the computing time to half, and the total computing time could reduce ~40% (well, layer norm time didn’t change much in precision). If used INT8, we could further reduce another 30% but increased the risk of precision loss.</p>

<p><img src="/images/fine-tune.009.png" alt="alt text" /></p>

<h2 id="understand-lora-ranks">Understand LoRA ranks</h2>

<p>Why rank matters in LoRA fine tune? why more knowledge adoption always comes with risk of overfitting in my LLM?</p>

<p>We love LoRA for its efficiency and low memory cost. We know LoRA fine tune is a decomposition of the update of weight matrix. Lower rank gives thinner matrix A and B. For example, if LoRA tune in attention layers, low rank only modifies a few attention patterns simultaneously, less likely to break existing patterns and less likely to disrupt critical cross-attention mechanisms. We usually follow the following rule of thumbs:</p>

<p>Knowledge injection: lower ranks (4-8) often sufficient
Domain adaptation: medium ranks (16-32) usually better
Complex reasoning changes: might need higher ranks (64+)</p>

<p>To understand the effect, consider each row in the matrix means update to a dimension, and the ratio of nuclear norm of the matrix vs forbenius norm means how much the information can spread in how many dimensions in the singularity matrix. The upper limit of the information spread is the rank. It explains much about the fine tune effect: low rank spread new knowledge toward the first a few dimensions and high rank can update in more dimensions, where new knowledge has deeper reach but brings in more risk of overfitting. Surely it is the upper limit of information spread, and it doesn’t promise the new information can reach that far.</p>

<p>You might wonder why it is “less or equal” instead of “equal”. It is because of Cauchy–Schwarz inequality for vectors <a href="https://lnkd.in/gKmjMKK6">https://lnkd.in/gKmjMKK6</a> which can also describe proper time measurement in relativity, a.k.a “you move fast your clock is slower”. There is always physics!</p>

<p><img src="/images/fine-tune.008.png" alt="alt text" /></p>

<h2 id="understand-chain-of-thoughts-cot">Understand chain-of-thoughts (CoT)</h2>

<p>Why LLM could do chain-of-thought? what exactly happened when LLM received a “think step by step” instruction?</p>

<p>CoT practically uses attention as working memory for each reasoning step for a computing cycle to evolve the hidden states in the neural network. When each new hidden state from a later reasoning step could query and update from the previous memory, it leads to a “step-by-step” reasoning. The key is about memory from the previous states!</p>

<p>It helps us to understand why sometimes CoT works well sometimes not: if a problem only needs its previous state and a piece of memory, CoT works well, otherwise, we need more complex reasoning models like OpenAI o1, since human can keep a long memory with branches and try-errors. Don’t forget human can also think P and ~P!</p>

<p>It also gives a good hint of memory package design if we want to extend such memory mechanism with longer or external memory.</p>

<p><img src="/images/fine-tune.007.png" alt="alt text" /></p>

<h2 id="understand-tool-using">Understand tool using</h2>

<p>What is the magic behind LLM’s tool using, like Apple Intelligence? a.k.a how come some language models can understand and call an API and some not?</p>

<p>Tool using, like “function calling” in OpenAI <a href="https://lnkd.in/gjMwbmaM">https://lnkd.in/gjMwbmaM</a> , opens a door to drive intelligent agents from LLM. Beyond just letting LLM generate a JSON to call an API, it is trained and tuned by aligning tasks with tool capabilities using attention of context and tools. When we use such functionalities, the LLM simply maximize the conditional probability of current context vs a tool by comparing context with tool description. That is the root reason why one must describe a tool in a concise and accurate way in any tool calling interface. We also understand tool calling doesn’t need large models, since it only needs attention alignment with tools, so small on-device LLMs like Phi3 or Llama 3.2 1b can do tool calling well if instruct trained well. Yes, it is part of Apple Intelligence LLM’s secret recipe.</p>

<p><img src="/images/fine-tune.006.png" alt="alt text" /></p>

<h2 id="understand-prompting">Understand prompting</h2>

<p>What exactly happened when LLM received a prompt, why “prompting” can magically work.</p>

<p>Most “prompting” work today is about discrete prompt, e.g. a sentence of command. Prompting introduces a task-specific bias to the model’s output distribution by its activation patterns, effectively aligning the target task with the LLM’s pre-trained task manifold. With this short definition, we can easily understand that prompts don’t change LLM, instead, they activate certain parts of the LLM neural network by breaking down the target task and aligning it with the similar trained tasks inside LLM. That is also why LLM can’t really “reason” but simulate the reasoning process if part of the process was trained in some familiar ways. Smaller tasks with agents usually work better than a long complex prompt because LLM could align small and simple tasks easier, so we either define our task process or let another agent breakdown the complex tasks.</p>

<p>In short, prompting is about putting bias to models and alignment to tasks.</p>

<p><img src="/images/fine-tune.005.png" alt="alt text" /></p>

<h2 id="understand-boltzmann-distribution-and-neural-networks">Understand Boltzmann distribution and neural networks</h2>

<p>What exactly Geoffrey Hinton brought to neural network and AI? Statistical meanings of neural networks!</p>

<p>Hinton and other researchers bridged the gap between statistical physics and neural networks by interpreting neural network input as probabilities instead of numbers, so that optimization and generalization of neural networks can make sense from Boltzmann distribution. Such energy-based models were the reason why gradient decent on log(P) and why “temperature” parameter is used to control your LLM creativity. Read further at Wikipedia <a href="https://lnkd.in/geDtyTFK">https://lnkd.in/geDtyTFK</a> and congratulate that John Hopfield and Geoffrey Hinton win Nobel Prize in physics!</p>

<p><img src="/images/fine-tune.004.png" alt="alt text" /></p>

<h2 id="understand-top_p-in-llm">Understand top_p in LLM</h2>

<p>What does “top P” in LLM model mean?</p>

<p>For a quick follow up from last post to understand high/low temperature in LLM (link https://lnkd.in/gcMDpSj4 ): why “top_p” can also control the next token choice?</p>

<p>Top P is the threshold of cumulated probability mass from token A to token X. For a given probability distribution, a higher top p value allows more long tail tokens. It gives more flexibility than a simple top K threshold for different context and different shape of the token probability distribution. For most cases, the combination of temperature and top_p setting can be good enough to control a LLM behavior.</p>

<p><img src="/images/fine-tune.003.png" alt="alt text" /></p>

<h2 id="understand-temperature-in-llm">Understand temperature in LLM</h2>

<p>What does “temperature” in LLM model mean?</p>

<p>Some friends recently asked me the question, why high temperature gives more creative results but much risk of hallucinations? why low temperature leads to dumb results? How to understand this magic parameter and how to use it?</p>

<p>Here it is one single picture to understand it: it is the “T” from softmax with temperature from Hinton et al “Distilling the Knowledge in a Neural Network“ <a href="https://lnkd.in/ghCdXgWx">https://lnkd.in/ghCdXgWx</a> With top-k/top-p token selection for LLM’s next token prediction, higher temperature gives more “flat” probability distribution so long tail tokens have better chance to be chosen, thus more creativity. It is the root cause of these high-low temperature behaviors.</p>

<p><img src="/images/fine-tune.002.png" alt="alt text" /></p>

<h2 id="fine-tune-a-llm-how-much-memory-do-i-need">Fine tune a LLM: how much memory do I need?</h2>

<p>Assume you bought a RTX 4090 to play “Black Myth: Wukong” and you also wanted to use it for fine-tuning a LLM. But can your gaming power handle the task?</p>

<p>Let’s break it down:
🧠 Model: 2B parameters, FP16 
🎮 RTX 4090: 24GB VRAM</p>

<p>Memory cost:</p>
<ul>
  <li>Model weights: 4GB</li>
  <li>Gradients: 4GB</li>
  <li>Optimizer states: 8GB</li>
  <li>Activations: ~6GB
Total: ~22GB</li>
</ul>

<p>Good news! Your RTX 4090 can handle this with just a little bit room to spare. You could even bump up the model size or batch size for better performance.</p>

<p>Remember, actual usage may vary based on specific architectures and frameworks. But this gives you a solid starting point for understanding LLM fine-tuning memory requirements.</p>

<p>Surely, there are other ways like LoRa or QLoRA of Parameter-Efficient Fine-Tuning (PEFT), along with some drawback and limits. Let’s talk about it next time.</p>

<p><img src="/images/fine-tune.001.png" alt="alt text" /></p>]]></content><author><name></name></author><summary type="html"><![CDATA[It is a live blog post of some knowledge snippets of AI to bridge the gap among text books, papers, other blog posts. Most content has been posted on my Linkedin.]]></summary></entry><entry><title type="html">Building a Lightweight Financial Agent: A Flexible Approach to Tool Use and Orchestration</title><link href="https://toooold.com/2024/09/15/tiny-financial-agent.html" rel="alternate" type="text/html" title="Building a Lightweight Financial Agent: A Flexible Approach to Tool Use and Orchestration" /><published>2024-09-15T00:00:00+00:00</published><updated>2024-09-15T00:00:00+00:00</updated><id>https://toooold.com/2024/09/15/tiny-financial-agent</id><content type="html" xml:base="https://toooold.com/2024/09/15/tiny-financial-agent.html"><![CDATA[<p>In the rapidly evolving field of AI agents, there’s a growing trend towards complex frameworks and libraries. However, for many practical applications, a simpler, more flexible approach can be just as effective. This blog post introduces a lightweight financial agent framework that demonstrates how powerful tool use and orchestration can be achieved without relying on heavy libraries like LangChain or LlamaIndex or CrewAI etc.</p>

<p>For most cases, one only needs “tool using” and “orchestration”, so why so complex? Check out the code <a href="https://github.com/phunterlau/tiny-financial-agent">here</a>.</p>

<p><img src="/images/fin-analysis.jpeg" alt="Red Panda financial analysis" /></p>

<h2 id="the-power-of-simplicity-and-flexibility">The Power of Simplicity and Flexibility</h2>

<p>Our framework consists of just three main components: a driver, tools, and orchestration functions. This simplicity inspired by functional programming offers several advantages:</p>

<ol>
  <li>Easy to understand and modify</li>
  <li>Not dependent on external packages beyond basic Python libraries and an API for language model interactions</li>
  <li>Flexible enough to handle both simple and complex queries</li>
  <li>Full control over the agent’s behavior, making it easier to adapt to specific use cases</li>
</ol>

<p>The core of our framework lies in its atomic tools and orchestration functions. Let’s explore how these components work together to create a flexible and powerful financial analysis agent. In this approach, human define a few orchestration patterns and how each pattern calls for tools, and LLM can map each question to one or more patterns to solve the problem. Here it is a sector analysis example where user asks a complex question “Considering the current economic climate, analyze the banking sector trends for the next 2 years and provide a comparative strategic investment analysis for JPMorgan Chase (JPM) and Bank of America (BAC).” and the agent understands it, maps it to a sector analysis orchestration flow, pick up the right tools, and summarize the results:</p>

<p><img src="/images/sector-analysis.png" alt="Sector analysis example" /></p>

<h2 id="atomic-tools-the-building-blocks">Atomic Tools: The Building Blocks</h2>

<p>Atomic tools are the fundamental operations our agent can perform. In our financial agent, these include functions like <code class="language-plaintext highlighter-rouge">get_stock_price</code>, <code class="language-plaintext highlighter-rouge">get_company_financials</code>, and <code class="language-plaintext highlighter-rouge">get_income_statement</code>. Here’s an example of how an atomic tool might be implemented:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_stock_price</span><span class="p">(</span><span class="n">symbol</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">FinancialData</span><span class="p">:</span>
    <span class="n">url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://financialmodelingprep.com/api/v3/quote-order/</span><span class="si">{</span><span class="n">symbol</span><span class="si">}</span><span class="s">?apikey=</span><span class="si">{</span><span class="n">API_KEY</span><span class="si">}</span><span class="s">"</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">json</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">FinancialData</span><span class="p">(</span><span class="o">**</span><span class="n">data</span><span class="p">)</span>
</code></pre></div></div>

<p>This function makes an API call to retrieve stock price data and returns it in a structured format. The simplicity of these atomic tools makes them easy to test, maintain, and extend.</p>

<h2 id="orchestration-connecting-the-dots">Orchestration: Connecting the Dots</h2>

<p>While atomic tools are powerful, they often need to be combined in complex ways to perform meaningful analyses. This is where orchestration functions come in. Orchestration allows us to dynamically connect tools using chain-of-thought (CoT) reasoning, enabling more sophisticated analyses.</p>

<p>Let’s look at two orchestration functions to illustrate the range of complexity possible within this framework:</p>

<ol>
  <li>A simple orchestration function: SectorAnalysis</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SectorAnalysis</span><span class="p">(</span><span class="n">OrchestrationFunction</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">gather_data</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sector</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">top_n</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]:</span>
        <span class="n">companies</span> <span class="o">=</span> <span class="n">get_top_companies</span><span class="p">(</span><span class="n">sector</span><span class="p">,</span> <span class="n">top_n</span><span class="p">)</span>
        <span class="n">sector_data</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">company</span> <span class="ow">in</span> <span class="n">companies</span><span class="p">:</span>
            <span class="n">financials</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">use_atomic_function</span><span class="p">(</span><span class="s">'get_company_financials'</span><span class="p">,</span> <span class="n">company</span><span class="p">[</span><span class="s">'symbol'</span><span class="p">])</span>
            <span class="n">income</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">use_atomic_function</span><span class="p">(</span><span class="s">'get_income_statement'</span><span class="p">,</span> <span class="n">company</span><span class="p">[</span><span class="s">'symbol'</span><span class="p">])</span>
            <span class="n">stock_price</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">use_atomic_function</span><span class="p">(</span><span class="s">'get_stock_price'</span><span class="p">,</span> <span class="n">company</span><span class="p">[</span><span class="s">'symbol'</span><span class="p">])</span>
            <span class="n">sector_data</span><span class="p">.</span><span class="n">append</span><span class="p">({</span>
                <span class="s">"symbol"</span><span class="p">:</span> <span class="n">company</span><span class="p">[</span><span class="s">'symbol'</span><span class="p">],</span>
                <span class="s">"name"</span><span class="p">:</span> <span class="n">financials</span><span class="p">.</span><span class="n">companyName</span><span class="p">,</span>
                <span class="s">"market_cap"</span><span class="p">:</span> <span class="n">financials</span><span class="p">.</span><span class="n">marketCap</span><span class="p">,</span>
                <span class="s">"revenue"</span><span class="p">:</span> <span class="n">income</span><span class="p">.</span><span class="n">revenue</span><span class="p">,</span>
                <span class="s">"net_income"</span><span class="p">:</span> <span class="n">income</span><span class="p">.</span><span class="n">net_income</span><span class="p">,</span>
                <span class="s">"pe_ratio"</span><span class="p">:</span> <span class="n">stock_price</span><span class="p">.</span><span class="n">PE</span>
            <span class="p">})</span>
        <span class="k">return</span> <span class="p">{</span><span class="s">"sector"</span><span class="p">:</span> <span class="n">sector</span><span class="p">,</span> <span class="s">"top_n"</span><span class="p">:</span> <span class="n">top_n</span><span class="p">,</span> <span class="s">"companies"</span><span class="p">:</span> <span class="n">sector_data</span><span class="p">}</span>
</code></pre></div></div>

<p>This function performs a straightforward analysis of top companies in a given sector. It uses atomic functions in a predetermined sequence to gather and structure data.</p>

<ol>
  <li>A complex orchestration function: CompanyComparativeAnalysis</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CompanyComparativeAnalysis</span><span class="p">(</span><span class="n">OrchestrationFunction</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">gather_data</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">symbol1</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">symbol2</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">time_horizon</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]:</span>
        <span class="n">company1_data</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_gather_company_data</span><span class="p">(</span><span class="n">symbol1</span><span class="p">)</span>
        <span class="n">company2_data</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_gather_company_data</span><span class="p">(</span><span class="n">symbol2</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"company1"</span><span class="p">:</span> <span class="n">company1_data</span><span class="p">,</span>
            <span class="s">"company2"</span><span class="p">:</span> <span class="n">company2_data</span><span class="p">,</span>
            <span class="s">"time_horizon"</span><span class="p">:</span> <span class="n">time_horizon</span>
        <span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">_gather_company_data</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">symbol</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]:</span>
        <span class="n">financials</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">use_atomic_function</span><span class="p">(</span><span class="s">'get_company_financials'</span><span class="p">,</span> <span class="n">symbol</span><span class="p">)</span>
        <span class="n">income</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">use_atomic_function</span><span class="p">(</span><span class="s">'get_income_statement'</span><span class="p">,</span> <span class="n">symbol</span><span class="p">)</span>
        <span class="n">stock_price</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">use_atomic_function</span><span class="p">(</span><span class="s">'get_stock_price'</span><span class="p">,</span> <span class="n">symbol</span><span class="p">)</span>
        <span class="n">historical_data</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">use_atomic_function</span><span class="p">(</span><span class="s">'get_historical_price_data'</span><span class="p">,</span> <span class="n">symbol</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"symbol"</span><span class="p">:</span> <span class="n">symbol</span><span class="p">,</span>
            <span class="s">"financials"</span><span class="p">:</span> <span class="n">financials</span><span class="p">,</span>
            <span class="s">"income"</span><span class="p">:</span> <span class="n">income</span><span class="p">,</span>
            <span class="s">"stock_price"</span><span class="p">:</span> <span class="n">stock_price</span><span class="p">,</span>
            <span class="s">"historical_data"</span><span class="p">:</span> <span class="n">historical_data</span>
        <span class="p">}</span>

    <span class="k">def</span> <span class="nf">prepare_prompt</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"""
        Perform a comparative analysis of </span><span class="si">{</span><span class="n">data</span><span class="p">[</span><span class="s">'company1'</span><span class="p">][</span><span class="s">'symbol'</span><span class="p">]</span><span class="si">}</span><span class="s"> and </span><span class="si">{</span><span class="n">data</span><span class="p">[</span><span class="s">'company2'</span><span class="p">][</span><span class="s">'symbol'</span><span class="p">]</span><span class="si">}</span><span class="s"> over a </span><span class="si">{</span><span class="n">data</span><span class="p">[</span><span class="s">'time_horizon'</span><span class="p">]</span><span class="si">}</span><span class="s"> time horizon.
        Include a competitive analysis and assessment of investment potential for both companies.
        
        Company 1 (</span><span class="si">{</span><span class="n">data</span><span class="p">[</span><span class="s">'company1'</span><span class="p">][</span><span class="s">'symbol'</span><span class="p">]</span><span class="si">}</span><span class="s">) Data:
        </span><span class="si">{</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'company1'</span><span class="p">],</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span><span class="si">}</span><span class="s">
        
        Company 2 (</span><span class="si">{</span><span class="n">data</span><span class="p">[</span><span class="s">'company2'</span><span class="p">][</span><span class="s">'symbol'</span><span class="p">]</span><span class="si">}</span><span class="s">) Data:
        </span><span class="si">{</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'company2'</span><span class="p">],</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span><span class="si">}</span><span class="s">
        
        Provide a comprehensive analysis covering:
        1. Competitive position of both companies
        2. Financial performance comparison
        3. Growth prospects over the </span><span class="si">{</span><span class="n">data</span><span class="p">[</span><span class="s">'time_horizon'</span><span class="p">]</span><span class="si">}</span><span class="s"> time horizon
        4. Potential risks and opportunities
        5. Overall investment potential comparison
        """</span>
</code></pre></div></div>

<p>This more complex function demonstrates how orchestration can adapt to different scenarios and gather a wider range of data. It shows how orchestration functions can implement more sophisticated logic to determine which tools to use and how to combine their outputs.</p>

<h2 id="the-power-of-orchestration-in-action">The Power of Orchestration in Action</h2>

<p>To truly appreciate the flexibility and power of our orchestration approach, let’s examine how a complex query triggers the appropriate orchestration function:</p>

<p>Query: “Compare the investment potential of Microsoft (MSFT) and Google (GOOGL) over the next 3 years, including a competitive analysis of both companies.”</p>

<p>This query would activate the <code class="language-plaintext highlighter-rouge">CompanyComparativeAnalysis</code> orchestration function:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># CompanyComparativeAnalysis execution
</span><span class="n">company1_data</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_gather_company_data</span><span class="p">(</span><span class="s">'MSFT'</span><span class="p">)</span>
<span class="n">company2_data</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_gather_company_data</span><span class="p">(</span><span class="s">'GOOGL'</span><span class="p">)</span>

<span class="c1"># For each company, the following atomic functions are called:
# - get_company_financials
# - get_income_statement
# - get_stock_price
# - get_historical_price_data
</span>
<span class="c1"># The gathered data is then used to prepare a comprehensive prompt for the language model
</span></code></pre></div></div>

<p>This example showcases how our framework can handle complex queries by combining multiple atomic tools within a single, sophisticated orchestration function. It performs a comparative analysis, including competitive positioning and investment potential assessment for both companies over the specified time horizon.</p>

<h2 id="flexibility-in-action-the-functioncallingagent">Flexibility in Action: The FunctionCallingAgent</h2>

<p>The heart of our framework’s flexibility lies in the <code class="language-plaintext highlighter-rouge">FunctionCallingAgent</code> class. This class determines which orchestration function to call based on the user’s query. Here’s a simplified version of its <code class="language-plaintext highlighter-rouge">chat</code> method:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">chat</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="bp">self</span><span class="p">.</span><span class="n">memory</span><span class="p">.</span><span class="n">append</span><span class="p">({</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">query</span><span class="p">})</span>
    
    <span class="n">response</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">llm</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="n">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="s">"gpt-4o-mini"</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">memory</span><span class="p">,</span>
        <span class="n">functions</span><span class="o">=</span><span class="p">[</span><span class="n">tool</span><span class="p">.</span><span class="n">model_dump</span><span class="p">()</span> <span class="k">for</span> <span class="n">tool</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tools</span><span class="p">],</span>
        <span class="n">function_call</span><span class="o">=</span><span class="s">"auto"</span>
    <span class="p">)</span>
    
    <span class="k">if</span> <span class="n">response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">function_call</span><span class="p">:</span>
        <span class="n">function_call</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">function_call</span>
        <span class="n">function_name</span> <span class="o">=</span> <span class="n">function_call</span><span class="p">.</span><span class="n">name</span>
        <span class="n">function_args</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">function_call</span><span class="p">.</span><span class="n">arguments</span><span class="p">)</span>
        
        <span class="n">result</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">orchestration_functions</span><span class="p">[</span><span class="n">function_name</span><span class="p">].</span><span class="n">execute</span><span class="p">(</span><span class="o">**</span><span class="n">function_args</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">memory</span><span class="p">.</span><span class="n">append</span><span class="p">({</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"function"</span><span class="p">,</span> <span class="s">"name"</span><span class="p">:</span> <span class="n">function_name</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">result</span><span class="p">)})</span>
    
    <span class="n">final_response</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">llm</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="n">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="s">"gpt-4o-mini"</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">memory</span>
    <span class="p">)</span>
    
    <span class="bp">self</span><span class="p">.</span><span class="n">memory</span><span class="p">.</span><span class="n">append</span><span class="p">({</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"assistant"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">final_response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span><span class="p">})</span>
    <span class="k">return</span> <span class="n">final_response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span>
</code></pre></div></div>

<p>This design allows the agent to dynamically select the most appropriate orchestration function based on the query’s complexity and requirements.</p>

<h2 id="conclusion-a-flexible-design-pattern">Conclusion: A Flexible Design Pattern</h2>

<p>The agentic framework presented here is not just a collection of tools, but a design pattern for approaching complex problems. By separating atomic tools from orchestration functions and employing a flexible function-calling agent, we create a system that can easily adapt to new scenarios or be extended with new capabilities.</p>

<p>This approach also positions us well for future developments in AI. As more advanced chain-of-thought models become available, we can easily adapt our framework. We could use smaller, more efficient models for atomic tool use, reserving the more powerful CoT models for complex orchestration tasks.</p>

<p>In conclusion, while there’s certainly a place for comprehensive agent frameworks, there’s also value in understanding how to build lightweight, customizable agents from the ground up. This approach gives developers more control, better understanding of their agents’ behavior, and the flexibility to adapt to new developments in AI technology.</p>

<p>The complete code for this financial agent example, along with additional documentation, can be found at <a href="https://github.com/phunterlau/tiny-financial-agent">GitHub link</a>. We encourage you to explore, adapt, and build upon this framework for your own projects.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In the rapidly evolving field of AI agents, there’s a growing trend towards complex frameworks and libraries. However, for many practical applications, a simpler, more flexible approach can be just as effective. This blog post introduces a lightweight financial agent framework that demonstrates how powerful tool use and orchestration can be achieved without relying on heavy libraries like LangChain or LlamaIndex or CrewAI etc.]]></summary></entry></feed>