Mozart rolls a dice to Bach and Ramanujan

The Elegance of Mozart’s Attention Mechanism
In 1792, Mozart’s Musikalisches Würfelspiel (Musical Dice Game), K.516f, was published. The system is deceptively simple: 176 pre-composed musical measures arranged in a grid. The user rolls two six-sided dice ($2d6$) 16 times. Each roll corresponds to a specific measure for that column in the grid, generating a mathematically unique 16-bar minuet.
From a LLM mechanistic interpretability standpoint, the beauty of Mozart’s game is that it is a strictly autoregressive, discrete-token generator with a context window of zero.
In a standard Large Language Model (LLM), predicting the next token $x_t$ relies on the conditional probability of the entire past sequence:
\[P(x_t | x_1, x_2, \dots, x_{t-1})\]Mozart bypassed the need for this computational overhead. In K.516f, the choice of Measure 3 has zero statistical dependence on Measure 2. The generation is completely memoryless. Instead, the model’s “attention” is 100% focused on its absolute positional encoding (the step $t$): \(P(x_t | \text{position } t, \text{dice roll})\)
How does it remain harmonically coherent without context? Mozart engineered the matrix as an aggressive, hardcoded attention mask. He ensured that every possible measure at $t$ smoothly resolves into every possible measure at $t+1$. Any dissonant, harmonically invalid transition was manually assigned a $-\infty$ pre-softmax penalty by the composer, effectively masking it out of the latent space.
Furthermore, the $2d6$ sampling acts as a physical temperature parameter. By using a triangular probability distribution ($P(7) = 16.7\%$, $P(2) = 2.7\%$) rather than a uniform one, Mozart lowered the entropy of the system. He statistically biased the model to generate the most “standard” harmonic progressions, reserving high-surprise edge cases for the extreme tails of the distribution.
Unifying the Grid: The Ramanujan Sum
If we were to code Mozart’s game today, we would use a simple for loop to force the piece to stop at $t=16$. But why does a 16-measure grid feel psychologically and harmonically complete? To understand this, we must abandon the discrete grid and apply the continuous mathematics of Srinivasa Ramanujan.
Ramanujan would not view Mozart’s matrix as a set of rules, but rather as the natural resonant frequency of a periodic equation. We can model the macro-structure of the minuet using a Ramanujan Sum ($c_q(n)$), which extracts periodic signals from noise:
\[c_q(n) = \sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} e^{2\pi i \frac{a}{q} n}\]By setting the fundamental period $q = 16$, the equation acts as a harmonic pendulum. Here is how Mozart’s attention mechanism unifies with Ramanujan’s math:
The Journey ($n = 1$ to $15$): As the measures progress, the complex exponentials point in various directions in the complex plane, causing destructive interference. Musically, this represents harmonic tension—the algorithmic wave is wandering through the latent space, seeking resolution.
The Half-Cadence ($n = 8$): When we reach the halfway point, the fraction simplifies to $\frac{a}{2}$. The vectors snap to the real axis. This momentary, symmetrical mathematical pause perfectly mirrors the structural “half-cadence” in classical phrasing.
The Resolution ($n = 16$): At the final measure, the fraction simplifies to an integer. Every term in the sum points in the exact same direction ($e^{2\pi i a} = 1$). The destructive interference vanishes into a massive spike of constructive interference.
The structure doesn’t resolve because of an arbitrary grid boundary; it resolves because $q=16$ ($2^4$, the fractal symmetry of classical phrasing) is the fundamental node where the equation naturally reaches maximum constructive harmony. Mozart’s positional attention mechanism is simply the geometric projection of this periodic equation.
Expanding Dimensions: Bach’s Deep Self-Attention
If Mozart’s dice game is a rigid, 1D loop locked to $q=16$, Johann Sebastian Bach’s beautiful Fugues (The Well-Tempered Clavier which has a beautiful Chinese name 赋格) represent the expansion of this mathematical framework into high-dimensional, deep-memory architectures. A fugue cannot be generated by a zero-context Markov chain like in Mozart’s dice game. It begins with a single “prompt” token sequence: the Subject. When the second voice enters, it must continuously look back at the Subject to generate valid counterpoint.
In LLM terminology, Bach implemented Multi-Head Self-Attention.
Each voice (Soprano, Alto, Tenor, Bass) acts as an independent attention head. They process the exact same context window but project it into different dimensional spaces. While Mozart relied on stochastic dice (sampling), Bach relied on deterministic linear algebra. The initial Subject vector is subjected to complex matrix transformations in the latent space:
- Transposition (Translation: $f(x) + c$)
- Inversion (Reflection: $-f(x)$)
- Augmentation/Diminution (Time Scaling: $f(2t)$ or $f(t/2)$)
Bach also utilized what we mechanistic interpretability researchers call Induction Heads. When the Alto voice enters with the “Answer,” it acts as an attention circuit specifically trained to recognize the sequence in the Soprano’s past and perfectly reconstruct it at the current time step. Meanwhile, the other heads calculate orthogonal vectors (the Countersubject) to ensure the dot product of the combined voices perfectly satisfies the vertical rules of harmony.
If we return to Ramanujan, Bach’s polyphony represents the full, unconstrained analytic continuation of the harmonic equations. While Mozart collapsed the variables into a degenerate case (a rigid loop in C Major), Bach allowed the variables to become complex numbers, unlocking all 24 keys and forcing the equation to expand dynamically across the complex plane.
The Convergence
Whether we are engineering modern Transformers, calculating Ramanujan sums, or analyzing 18th-century manuscripts, the computational goal remains identical. LLM and music generation are ultimately the search for mathematical symmetry across time. Mozart mapped it via hardcoded masking and stochastic geometry; Bach calculated it via deep contrapuntal attention matrices; and Ramanujan provided the equations that prove they are all navigating the exact same latent space.