Chapter 4
What Does Broadcasting Hide?
Broadcasting is one of the great conveniences of array programming:
a = torch.randn(16, 1, 64)
b = torch.randn(1, 32, 64)
c = a + b
The result has shape (16, 32, 64). The second dimension of a expands. The
first dimension of b expands. Most tensor programmers can read that after a
moment.
But the pause is interesting. For c[3, 5, :], which part of a is used?
Which part of b?
a[3, 0, :]
b[0, 5, :]
The answer is not difficult. That is part of the charm of broadcasting. But it
is not written in the expression a + b; it is recovered from a convention
about singleton dimensions and shape alignment. Broadcasting keeps the code
short by moving the coordinate story into a rule the reader already knows.
That trade is often worth it. It is also an excellent place for a quiet mistake to hide.
The Design Fork: Implicit Expansion or Visible Absence
Broadcasting offers a tempting design: let singleton dimensions expand silently, and trust shape rules to reject impossible cases. This is concise, and for many programs it is exactly what the programmer wants. The weakness is that the source does not say which coordinate was absent by design.
A stricter design could forbid broadcasting entirely. That would make every reuse explicit, and it would also make ordinary tensor formulas unpleasantly heavy. The useful middle ground is to make absence visible in the coordinate story without forcing every expanded element to be written.
Einlang’s notation follows that middle path. A term that lacks j is not an
accident of layout; it is a value that does not depend on j. The design
principle is the same as before: if the missing coordinate matters to later
checking or differentiation, the source should not hide it inside a singleton
dimension.
Every singleton axis should raise a small question: is this value truly independent of that coordinate, or did we merely arrange memory so the library would accept it? A feature bias is independent of batch. A per-example offset is not. Both can be stored as vectors. The coordinate expression is where the distinction becomes hard to miss.
Missing Coordinates
An index-oriented version can make the absence explicit:
let a[i, ~, k] = ...
let b[~, j, k] = ...
let c[i, j, k] = a[i, k] + b[j, k]
The marker ~ denotes an intentionally missing coordinate. It says that a
host layout may contain a singleton dimension, but the value’s meaning does not
depend on that coordinate.
In the expression for c, the term a[i, k] depends on i and k, but not
on j. The term b[j, k] depends on j and k, but not on i. So for one
concrete output coordinate:
c[3, 5, k] = a[3, k] + b[5, k]
There is no hidden singleton axis to remember. Broadcasting becomes a property
of the expression’s free coordinates. If a term does not mention j, it cannot
possibly care which j you picked.
Bias Has a Job
Broadcasting bugs are usually not rank bugs. They are role bugs. A tensor may have a shape that permits expansion while still expanding along the wrong meaning.
Consider:
y = x + bias
If x has logical shape [batch, feature], then a feature bias is:
let y[b, f] = x[b, f] + bias[f]
A batch bias is a different program:
let y[b, f] = x[b, f] + bias[b]
Both can produce a two-dimensional result. Only one matches the intended model. The indexed form forces the decision to appear where the addition is written.
Read the two concrete points:
y[3, 5] = x[3, 5] + bias[5] feature bias
y[3, 5] = x[3, 5] + bias[3] batch bias
A bias vector is more than “a vector.” It is a vector with a job. The index tells you what job it is doing.
This is also why broadcasting is a semantic choice, not a rank trick. If
bias has length 64, it could match a feature coordinate of
length 64, a time coordinate of length 64, or a class coordinate of length
64. The size alone cannot choose among those stories:
x[b, feature] + bias[feature]
x[b, time] + bias[time]
x[b, class] + bias[class]
The notation asks the addition to say which reuse it means.
This separates logical absence from physical layout. A backend may still store
bias[feature] as a tensor with a singleton batch dimension because that
layout is convenient for a kernel. The source-level claim is different:
bias does not depend on b. Treating those as separate facts gives the
compiler freedom to choose a layout without pretending that a singleton axis
has semantic content.
Standard Library Sighting: Reduction
The reduction operators in stdlib/ml/reduction_ops.ein show the same
discipline in a small production example:
let result[..batch] = sum[j](x[..batch, j]);
This is reduce_sum. The ordinary description would be “reduce over the last
axis.” The indexed description says more. The rest pattern ..batch survives.
The local coordinate j is introduced by sum and consumed inside the
reduction body. The output keeps the batch-shaped part and loses the feature
coordinate.
For a concrete row:
result[b] = x[b, 0] + x[b, 1] + ... + x[b, n - 1]
The coordinate j walks across the row, fills the sum, and leaves. Only the
batch label remains on the result.
This is also the first place where reduction and broadcasting meet. A value
that no longer has j may later be used in an expression that does have j.
If that happens, it will be invariant along j because j was consumed.
Standard Library Sighting: Softmax
softmax in stdlib/ml/activations.ein is a denser example:
let output[..batch, j] =
exp(x[..batch, j] - max[q](x[..batch, q]))
/ sum[k](exp(x[..batch, k] - max[q](x[..batch, q])));
There are three coordinate roles in this one expression. The coordinate j is
the feature being returned. The coordinate k is local to the denominator. The
coordinate q is local to the maximum used for numerical stability. The batch
tail stays untouched through all of it.
That separation matters. If all three roles were described only as “the last axis,” the reader would have to keep their scopes in memory. Here the scopes are written into the formula:
j the feature we are computing
k the feature axis scanned by the denominator
q the feature axis scanned by the maximum
The names are small, but they make the normalization legible. Anonymous axes invite accidental swaps; named local coordinates make the swaps easier to see.
What Does a Missing Coordinate Mean?
The softmax expression is a useful place to notice how much is carried by
absence. Its three appearances of x mention different local coordinates:
x[..batch, j]
x[..batch, k]
x[..batch, q]
The whole chapter is in that difference. Broadcasting is not a separate mechanism. It is what happens when a value does not depend on a coordinate that the surrounding expression still has. The important question is not only which coordinate is written, but which coordinate a term is allowed to ignore.
The Pair With Reduction
Broadcasting and reduction are paired ideas. Broadcasting is what happens when a coordinate is absent from a term but present in the surrounding result. Reduction is what happens when a coordinate is present locally but absent from the surrounding result.
Put them side by side:
let y[b, f] = x[b, f] + bias[f]
let total[b] = sum[f](x[b, f])
In the first line, bias[f] omits b, so it is reused for every batch item.
In the second line, f is introduced by sum and consumed, so it does not
survive into total.
This pairing matters in the gradient chapters. A value broadcast in the forward pass usually receives a summed gradient in the backward pass. A coordinate consumed in the forward pass often reappears in the shape of a parameter gradient. The source notation makes those relationships easier to predict because it shows which coordinates were absent and which were local.
Every expression can now be read with two questions instead of one. Which coordinates are present? Which coordinates are deliberately absent? The second question is what broadcasting usually hides, and it is often the question that finds the bug.
Pressure Test: The Wrong Shared Parameter
Take a matrix of activations and two possible biases:
let x[b, f] = ...
let feature_bias[f] = ...
let batch_bias[b] = ...
Adding the feature bias is:
let y[b, f] = x[b, f] + feature_bias[f]
The term feature_bias[f] omits b, so the same feature bias is used for
every example. Read two cells:
y[0, 3] = x[0, 3] + feature_bias[3]
y[7, 3] = x[7, 3] + feature_bias[3]
The coordinate b changed, but the bias address did not. That is the precise
meaning of the broadcast.
Adding the batch bias is different:
let y[b, f] = x[b, f] + batch_bias[b]
Now batch_bias[b] omits f, so every feature in the same example receives
the same offset:
y[7, 0] = x[7, 0] + batch_bias[7]
y[7, 3] = x[7, 3] + batch_bias[7]
Both programs produce a tensor addressed by [b, f]. Both are valid if the
model wants that behavior. They are not interchangeable. A positional
broadcasting system may encode the difference as a vector of shape [F]
versus a vector reshaped to [B, 1]. That works, but the role is implicit in
where the singleton dimension was inserted. The named form says the role at
the use site.
This is where the example stops being a reminder about broadcasting and starts
being a real failure mode. If batch_count and feature_count are both 128,
the wrong vector can have the right length. A shape-only error message may not
appear at all. A visible-coordinate reading still catches the mistake because
the suspicious line is local:
feature_bias[b]
That address says the feature bias is being indexed by examples. Even if the lengths match, the role does not.
The same audit explains the reverse direction in gradients. If a scalar loss depends on:
let y[b, f] = x[b, f] + feature_bias[f]
then the gradient of feature_bias[f] must collect all batch routes:
let dfeature_bias[f] = sum[b](dy[b, f])
The missing coordinate in the forward expression becomes a reduced coordinate
in the backward expression. That is not a special trick of biases. It is the
coordinate accounting forced by reuse. A value reused along b receives
sensitivity from every b.
So the concrete question for any broadcast is not “did the library expand a singleton dimension?” The better question is:
Which coordinate does this term not mention?
That question tells you where the value is constant, where mistakes can hide, and which coordinate will later be collected if a derivative flows backward.
What Static Checking Can Actually See
In the explicit form, broadcasting is not an afterthought. The compiler sees which indices a term uses:
x[b, f] uses b and f
feature_bias[f] uses f
batch_bias[b] uses b
That lets the checker separate three questions. First, are the addressed values rectangular and indexable? Second, do the index ranges agree with the array extents? Third, which result coordinates are absent from each term?
The third question is the semantic one. If feature_bias[f] appears in a
result addressed by [b, f], the absent coordinate is b; the value is
constant along batch. If batch_bias[b] appears, the absent coordinate is
f; the value is constant along features. This is exactly the fact a
positional broadcasting rule reconstructs from singleton dimensions.
Making absence visible also improves explanations. A good error or review
comment can say “this term does not depend on b” rather than “axis 0 was
broadcast.” The first sentence names the model role. The second names a layout
position. Both may be true, but the role sentence is the one that helps find a
semantic bug.
Explicit absence deserves a place beside explicit presence. A coordinate can matter precisely because one term does not mention it.
Where Clauses as Classified Facts
Broadcasting is not the only place where a coordinate may be constrained
without becoming part of the result. A where clause can introduce a local
binding or a guard:
let output[i, j] = activated
where z = sum[k](input[i, k] * weight[k, j]) + bias[j],
activated = if z > 0.0 { z } else { 0.0 };
Here z and activated are not output coordinates. They are local facts used
while computing each [i, j] cell. A different where clause can act as a
filter:
let upper[i, j] = matrix[i, j] where i <= j;
That line still defines a rectangular family addressed by [i, j], but the
guard controls which cells receive the expression value and which cells receive
the default.
The compiler has to distinguish those cases before range and shape reasoning
can stay sane. Constraint classification separates binding-like constraints
from guard-like constraints and orders bindings by dependency. A binding such
as activated = ... z ... depends on the earlier binding z; a guard such as
i <= j does not introduce a reusable value, but it does affect coverage and
execution.
Here the indexed style grows beyond axis names. The source can also classify
local facts: derived values, predicates, and guards. The compiler can then keep
output shape, local computation, and filtered coverage in separate buckets
instead of treating every where item as a string of condition text.