Einlang

Chapter 6

Softmax Has Three Coordinate Roles

Softmax is often described as an operation along an axis. That is true, but it is not quite enough. In a stable implementation, the same logical feature axis appears in several different roles.

The standard library writes the batched case as:

let output[..batch, j] =
    exp(x[..batch, j] - max[q](x[..batch, q]))
    / sum[k](exp(x[..batch, k] - max[q](x[..batch, q])));

The formula is longer than softmax(x, axis=-1), but it exposes the structure that the compact call hides.

The Design Fork: One Axis Name or Several Scopes

A tempting notation would use one feature name everywhere. After all, softmax normalizes over one axis. But the expression has several local jobs: one feature is being returned, another local coordinate scans the denominator, and another may scan the maximum used for numerical stability.

Using one name for all of those jobs makes the formula look simpler while making the scopes harder to see. Using distinct local names makes the formula slightly longer, but it reveals which occurrence is an address of the result and which occurrences are local scans.

Einlang’s choice is to make scope visible. A coordinate name is not only a label for an axis; it is also a statement about where that axis is live.

Compute just one probability by hand. To produce output[b, 7], the formula needs the score at feature 7, the maximum over all features in the same row, and the denominator over all features in the same row. Calling all of those positions “the softmax axis” is true but too coarse. The single output value already contains three scopes.

The Three Roles

The coordinate j is the feature being returned:

output[..batch, j]

The coordinate k is the feature coordinate scanned by the denominator:

sum[k](...)

The coordinate q is the feature coordinate scanned by the maximum:

max[q](...)

All three range over the same conceptual axis, but they have different scopes. That is why they deserve different names. If a formula reused one letter for all three jobs, the reader would have to infer the scopes from punctuation alone.

Why Stability Adds a Role

The maximum is not there to change the mathematical result. It is there to keep the exponentials numerically well behaved:

exp(x[j] - max[q](x[q]))

Subtracting the maximum shifts every feature by the same amount for a fixed batch item. The softmax probabilities stay the same, but the largest exponent becomes exp(0) instead of possibly overflowing.

That numerical trick introduces another local coordinate. The coordinate q does not choose the output feature. It scans all features to find the shift used by every output feature in the same row. In compact API code, this role is usually implicit. In the indexed expression, it is visible.

This is a recurring theme: practical numerical code often has extra structure that the mathematical slogan omits. “Softmax over the last axis” is the slogan. The stable formula has an output coordinate, a denominator coordinate, and a maximum coordinate.

That extra structure has a visible consequence. The stabilizing maximum is shared by every returned feature in the same batch row:

output[b, 0] uses max[q](x[b, q])
output[b, 7] uses max[q](x[b, q])
output[b, 12] uses max[q](x[b, q])

The maximum does not depend on j. It is broadcast across the feature being returned, and the notation shows why.

Coordinate Reading

For one batch item and one feature:

output[b, 7]

the numerator uses x[b, 7]. The denominator scans every x[b, k]. The stabilizing maximum scans every x[b, q]. The batch coordinate is fixed throughout.

The operation is not “broadcasting plus a reduction plus some exponentials” in the abstract. It is a set of coordinate relationships:

j  choose the feature being produced
k  collect all features for the denominator
q  collect all features for the stabilizing maximum

Why This Helps

Softmax bugs often come from normalizing over the wrong axis. In anonymous shape code, the difference between “over features” and “over batch” may be one integer argument.

With visible coordinates, the wrong program looks different:

let bad[b, j] =
    exp(x[b, j]) / sum[bb](exp(x[bb, j]))

This normalizes across batch for each feature. Maybe a program wants that. Most classifiers do not. The coordinate bb makes the mistake visible.

The Shape of the Result

Softmax preserves the feature coordinate. It normalizes values along that coordinate, but it does not remove it:

input   x[..batch, j]
output  output[..batch, j]

This distinguishes softmax from a reduction such as:

let total[..batch] = sum[j](x[..batch, j])

Both expressions scan features. Only one keeps the feature coordinate in the result. In softmax, each feature receives a normalized value. In sum, all features contribute to one value and then disappear.

That distinction is easy to blur when both operations are described as acting “over an axis.” The names make the difference mechanical: if j appears on the left, j survives. If j appears only inside sum[j], j leaves.

Connection to Gradients

This chapter also prepares the gradient chapters. Softmax is not elementwise in the feature coordinate. The value at output[j] depends on all x[k] in the same row through the denominator and the maximum. A derivative request must respect that coupling.

The visible coordinates warn the reader before calculus begins. The expression itself says that one output feature reaches across the whole feature row. That is why a softmax Jacobian has interactions among features rather than a purely diagonal elementwise shape.

Which Coordinates Stay Fixed?

The real softmax line turns on one question: which coordinates remain fixed while the denominator is computed? The answer is ..batch. The feature coordinate is being scanned locally. That one distinction is the difference between a per-example normalization and a cross-example one, and it is exactly the kind of distinction a shape tuple cannot explain by itself.

Why Softmax Belongs Here

Softmax belongs here because it combines the previous two ideas. It broadcasts the stabilizing maximum across the feature coordinate being returned, and it reduces the denominator across that same feature role. The coordinate is conceptually one axis, but operationally it appears in several scopes.

That makes softmax a good stress test for the reading method. A simple slogan does not carry enough detail:

softmax over features

The real expression asks the reader to distinguish the feature being produced, the features being summed, and the features being scanned for the maximum. If those roles are confused, the formula may still have plausible shapes while doing the wrong normalization.

A production tensor language needs to handle this kind of ordinary complexity. Softmax is important here because one mathematical axis plays multiple local roles inside a stable numerical formula.

Carry this forward: when one dimension appears more than once in an explanation, ask whether those appearances have the same scope. If they do not, give the scopes distinct names. That small act prevents many shape-correct but meaning-wrong programs.

Softmax also shows why visible dimensions cannot stop at shape checking. The input and output have the same shape, so a pure shape story would say almost nothing happened. But a great deal happened: every output feature was coupled to every input feature in the same row, and the row was normalized into a distribution.

The coordinate names reveal that hidden work. j survives, but it survives after being compared against a denominator built from k and a stabilizing maximum built from q. Same shape, different dependency structure. That is a pattern the later chapters will reuse for gradients and attention.

Pressure Test: Same Shape, Different Dependency Graph

Use one batch row with three logits:

x[b, 0] = 2.0
x[b, 1] = 1.0
x[b, 2] = 0.0

A stable softmax can be read as three coordinate stages:

let m[b] = max[q](x[b, q])
let e[b, k] = exp(x[b, k] - m[b])
let z[b] = sum[k](e[b, k])
let y[b, j] = e[b, j] / z[b]

The maximum uses q because it scans the row to find a stabilizing reference. For this row, m[b] = 2.0. The exponentials are then:

e[b, 0] = exp(0.0)
e[b, 1] = exp(-1.0)
e[b, 2] = exp(-2.0)

The denominator uses k because it scans the row again to build the normalizer:

z[b] = exp(0.0) + exp(-1.0) + exp(-2.0)

Finally j names the output feature being returned:

y[b, 0] = e[b, 0] / z[b]
y[b, 1] = e[b, 1] / z[b]
y[b, 2] = e[b, 2] / z[b]

The three output cells are different, but they share the same m[b] and z[b]. That is the coupling shape of softmax. The result still has coordinate [b, j], yet each y[b, j] depends on every input x[b, k] through the denominator and on every input x[b, q] through the maximum. The hard part is not computing a three-element softmax; it is remembering that same shape does not imply same dependency graph.

This is exactly why reusing one name for every role is misleading. If the formula wrote j for the output coordinate, the denominator coordinate, and the maximum coordinate, the scopes would blur. The expression would look as if one coordinate were doing one job. In reality, one logical feature axis is visited in several local scopes. The names q, k, and j do not describe different physical axes; they describe different binding sites over the same range.

A useful bug check is to ask which coordinates remain fixed while the row is normalized. Batch b is fixed. The distribution is over features. If the formula accidentally scans b instead:

let z[j] = sum[b](e[b, j])

then the program normalizes each feature across examples. That can produce the same output shape [b, j], but it answers a different question. It makes examples compete with one another instead of making features compete inside an example. The coordinate names make that difference visible at the denominator, where the mistake is born.

The derivative story also begins here. Because each y[b, j] depends on all x[b, k] in the same row, the pullback through softmax is not purely elementwise. A change to x[b, 1] can change y[b, 0], y[b, 1], and y[b, 2]. The output shape hides that coupling; the coordinate reading exposes it.

What Same Shape Hides

Softmax is a useful stress test because it preserves rank and extents:

input   x[b, j]
output  y[b, j]

A pure shape story might describe this as an elementwise operation from a matrix to a matrix. That would be false. The coordinate j survives into the answer, but the value at one j was computed using all feature positions in the row. The dependency structure is row-global even though the output address is local.

That distinction matters to optimization and differentiation. A backend may fuse the maximum, exponentials, sum, and division into one stable kernel, but it must preserve the fact that the normalization scope is per b and over feature positions. A differentiator must preserve the same coupling when sensitivities flow backward. A test that checks only output shape would miss the central contract.

So the chapter’s practical advice is: whenever an operation returns the same shape it received, do not assume it was coordinatewise. Ask which other coordinates each output cell inspected before it returned.

A final concrete check is to compare softmax with a true elementwise sigmoid:

let s[b, j] = 1 / (1 + exp(-x[b, j]))

Here s[b, j] reads only x[b, j]. No k or q scans the row. The output shape matches the input shape, just as softmax does, but the dependency structure is different. A change to x[b, 1] changes s[b, 1] only; it can change every y[b, j] in a softmax row. Same shape, different graph. That is why this chapter spent so much time on scopes instead of only on extents.

Once the scopes are clear, the stable formula no longer looks like an implementation trick. It becomes a sequence of coordinate claims: choose a row maximum, exponentiate each feature against it, sum the row, and return one normalized coordinate.