Reward Hacking in LLM-Guided Evolutionary Search

This post exposes reward hacking in an LLM-guided evolutionary run inside CliffSearch. The run already had task-specific gates and safeguards designed to keep selection from collapsing into benchmark score alone. We use this run to show the task, the in-loop gates, where those safeguards worked, where they failed, and how that failure allowed a reward-hacking branch to propagate through the genetic loop. We then draw lessons from this run for mitigating reward hacking by strengthening the harness with verifiable probes and enriching the task_preamble with stronger semantic requirements.

Inspect run on cliffsearch.ai/visualization under Task: Transformer Hyper-Connection Discovery and Config: bundle1_short_claude_all | setup2_code_only.

1. The Task Already Had Three Gates

The transformer task aims at improving hyperconnections in transformer architecture that evolved in the literature from dense hyperconnection (HC) to manifold hyperconnection (mHC) using doubly stochastic matrices, to mHC-lite that parametrized the doubly stochastic matrices with attention on permutation matrices to gain in efficiency.

The transformer task fixed the Shakespeare character benchmark on the small nanoGPT stack, fixed hyper_conn_n = 4, and asked the agents to evolve the routing manifold in a hyperconnection while preserving stable optimization, differentiability, and practical compute overhead using atomic parametrization of manifolds. In-loop, the run had three gates:

  • Gate 1: a task-specific runtime harness. It enforced the plugin contract, benchmark wiring, branch call, and outer shape/dtype/device behavior before or during expensive runs.
  • Gate 2: reviewer correctness. Nodes had to satisfy the reviewer on correctness, not just benchmark score.
  • Gate 3: reviewer originality. Nodes also had to satisfy the reviewer on originality; winner selection was not benchmark-only.
Scientific target:
- propose new learnable manifold constructions that can replace or extend this
  routing geometry while preserving stable optimization.
- keep computational overhead practical and preserve differentiability.

Hard requirements for every generated code_content:
- get_mhc_lite_overrides() must return "hyper_conn_type": "custom"
- get_mhc_lite_overrides() must return "hyper_conn_n": exactly 4
- EvoHyperConnection.forward(...) must call self.branch(...)
- output must preserve shape/dtype/device and keep gradient flow.

2. Gate 1: The Runtime Harness Filtered Nodes

CliffSearch has a task-specific runtime harness for this transformer task. It checked that nodes defined the required symbols, returned only the allowed override keys, instantiated the required module type, called self.branch(...), and preserved the benchmark-facing contract. It also surfaced benchmark adapter failures back into the run record and reviewer context.

code+design-intent · D2 · HyperbolicGeodesicRouting_v3

One failed node made the runtime's packing semantics explicit: the launcher was not handing it a leading stream axis to reinterpret.

The runtime passes residuals with shape `(batch, seq_len, dim)` = `(8, 256, 512)`, i.e. **without** a leading `num_streams` dimension. The runtime's expand/reduce stream wrappers handle the multi-stream expansion externally.

code+design-intent · B1 · HyperbolicAffinGatedRouting

The same task also produced nodes that looked structurally plausible but crashed because their stream layout assumptions were incompatible with the launcher.

The size of tensor a (2) must match the size of tensor b (4) at non-singleton dimension 0

code+design-intent · F1 · PoincareBallRoutingHC_v2

And the benchmark adapter explicitly rejected nodes whose forward pass violated the output-shape contract.

Benchmark adapter reported contract_status=invalid: EvoHyperConnection.forward must preserve residual tensor shape.

3. Gate 2: Reviewer Flagged Issues and Suggested Corrections

The reviewer diagnoses concrete runtime/layout failures and explains why a high-scoring node looked correct, and suggests fixes and corrections to code issues and conceptual mathematical concerns.

code+design-intent · B1 · HyperbolicAffinGatedRouting

It also flagged real shape/layout failures inside that same run.

The code is syntactically valid and structurally compliant with the API contract, but it fails at runtime due to incorrect assumptions about tensor shapes when interacting with the framework's stream management.

code+design-intent · D2 · HyperbolicGeodesicRouting_v3

In another failure case, it corrected the node's model of the runtime itself rather than merely saying the tensor code was bad.

The correction mutation's summary claims to fix the shape preservation issue but completely misdiagnoses the root cause — it blames `LayerNorm` and contiguity rather than recognizing that the runtime does not pass a `num_streams` leading dimension.

code+design-intent · H3 · GrassmannStiefelHC

And sometimes the reviewer proposed a concrete code fix together with an implementation concern.

The fix would be straightforward (cast to float32 before `torch.linalg.solve`, then cast back) ... `gate.item()` inside `forward` forces a device-to-host sync every forward pass.

code+design-intent · G3 · PoincareHC

It also flagged conceptual mathematical imprecision when the code ran but the geometric description oversold what was actually implemented.

The summary claims "Möbius weighted midpoint (gyro-midpoint)" which is slightly imprecise but not incorrect as an approximation.

code+design-intent · D3 · HyperbolicGeodesicRouting_v4

And it corrected theory-language when the mathematical label itself did more work than the implementation justified.

The Möbius addition approximation `x + y / (1 + c*||x||*||y||)` is a loose simplification of actual Möbius addition on the Poincaré ball ... the "hyperbolic" label is somewhat aspirational.

In this run, reviewer correctness did more than score nodes up or down. It produced actionable debugging signals, runtime diagnoses, and mathematical corrections that shaped which candidates stayed alive in the loop.

4. Gate 3: Reviewer Scrutinized Originality and Held Up a Post Hoc Audit

The reviewer pushed back on overclaimed theory: it called out misleading labels, loose geometric language, and literature-adjacent transfers that were less novel than implied by proposer agents.

code+design-intent · B3 · SinkhornDoublyStochasticHC

The reviewer could recognize a competent and effective node while still saying that the mechanism was mostly a known idea applied cleanly.

The Sinkhorn doubly-stochastic approach is mathematically sound and well-executed, but it's a known technique applied in a straightforward manner.

code+design-intent · F1 · PoincareBallRoutingHC_v2

It could also ground a geometric transfer against prior literature rather than treating it as a wholly new mechanism.

Poincaré ball embeddings in neural networks are well-established ... applying them to residual routing is a relatively straightforward adaptation.

code+design-intent · A3 · PoincareGivensHybridHC

It could also reward a strong crossover while still distinguishing between a useful recombination and a radically new mechanism.

The crossover is well-executed but not radically new in mechanism.

code+design-intent · H3 · GrassmannStiefelHC

And it could identify when a node introduced a genuinely different manifold construction even if the implementation later failed.

The idea of Stiefel manifold routing via Cayley retraction of low-rank skew-symmetric matrices is a genuinely novel manifold construction compared to the parent's Poincaré ball approach.
Post hoc audit of originality assessment of the reviewer

We then audited reviewer's originality assesment post hoc to Cliffsearch's loop. The agents did not have literature retrieval during the run. After the run, we prepared a survey over manifold-attention, geometry-aware fusion, and direct HC/mHC papers, and then compared the discovered families against both the broader manifold-attention surface and the narrower HC/mHC line. The result was not perfect agreement, but the overall originality story held up: the reviewer was directionally aligned, with its main optimistic cases concentrated in literature-adjacent hyperbolic transfers.

5. Human-In-The-Loop Inspection of Suspicious Winning Branches

We saw unusually low cross-entropies in one branch of the code+design-intent run and inspected the artifacts of the nodes that were reviewer-validated and winning in the benchmark.

Node Mean val_loss Reviewer correctness Reviewer originality
PoincareHC 0.0085 4/5 4/5
GivensWidthBetaDepthHC 0.0090 5/5 4/5
PoincareGivensHybridHC 0.00733 5/5 4/5

The strongest case was the final crossover node PoincareGivensHybridHC. It got correctness = 5/5, originality = 4/5, and the benchmark reported mean_val_loss = 0.00733 across three seeds with extremely low variance.

6. Human-In-The-Loop Post Hoc Correctness Audit Found a Tensor-Semantics Bug

We started by looking through the code of the reviewer-validated winning nodes. There was no outer shape error: these nodes had already passed the runtime harness, and the reviewer had accepted the same contract-facing story, as shown below. Our inspection instead unveiled a subtler bug that broke causality and made the anomalously low cross-entropies possible.

code+design-intent · A3 · PoincareGivensHybridHC

We give below the reviewer that missed these reward-hacking nodes: a winner that looked contract-clean and benchmark-strong, but was actually semantically wrong.

- Shape preservation: Input residuals shape (B*S, ..., D) is reshaped to (B, ..., S, D), processed, then reshaped back to original shape via new_streams.view(*shape). ✓
- Gradient flow: All operations are differentiable ... ✓
- Benchmark Evidence: 3/3 seeds succeeded. Mean val_loss = 0.007333, std = 4.7e-5.
- Correctness Score: 5/5

7. Identifying the Bug with Invariance and Leakage Probes

We then inspected leakage directly. We tested the two invariances that this task should have enforced from the start: future-token invariance and batch isolation. If changing only future tokens changes earlier logits, the node leaks causal information. If changing one batch element changes logits for untouched samples, the node leaks across samples.

# future-token invariance
logits_a = model(x)
x_future = x.clone()
x_future[:, split:] = x_future[:, split:].roll(shifts=1, dims=1)
logits_b = model(x_future)
future_prefix_max_abs_diff = (logits_a[:, :split] - logits_b[:, :split]).abs().max()

# batch isolation
logits_a = model(batch)
batch_edited = batch.clone()
batch_edited[1] = batch_edited[1].roll(shifts=1, dims=0)
logits_b = model(batch_edited)
sample0_max_abs_diff = (logits_a[0] - logits_b[0]).abs().max()

On the suspicious branch, those probes failed hard on the future-prefix max-absolute-difference test. By contrast, verified survivors from the final populations passed the same probe cleanly.

Suspicious branch node Leakage probe
PoincareHC future-prefix diff = 3.0281
GivensWidthBetaDepthHC future-prefix diff = 2.9409
PoincareGivensHybridHC future-prefix diff = 3.0293

Verified survivors from the final populations passed the same probe cleanly.

Verified survivor Mean val_loss Leakage probe
HyperbolicExpMapRouting 5.1248 future-prefix diff = 0.0, batch leakage not detected
GivensOrthogonalManifoldHC 5.2677 future-prefix diff = 0.0, batch leakage not detected
HouseholderCayleyManifoldHC 5.4502 future-prefix diff = 0.0, batch leakage not detected

Those probe failures pushed us back into the code and let us identify the bug precisely. The problem was not an outer shape error. It was semantic: the node unpacked the packed stream-expanded tensor with a raw view that was size-compatible but not semantically compatible with how the runtime packed streams.

# from PoincareGivensHybridHC
leading = list(shape[:-1])
leading[0] = leading[0] // S
streams = residuals.view(*leading, S, D)

The runtime’s stream expansion is conceptually:

# runtime packing intent
repeat('b t d -> (b s) t d', s=4)

So the packed tensor is laid out as repeated stream copies for each batch element. A raw view(*leading, S, D) is not the same as a semantic unpacking of (b s) t d -> b t s d. That difference is exactly where the leak entered.

# packed buffer for B=2, S=4, T=3
[b0, b0, b0, b0, b1, b1, b1, b1]

# raw view as (B, T, S, D) interleaves sequence positions:
[[0, 1, 2, 0],
 [1, 2, 0, 1],
 [2, 0, 1, 2]]

# semantic unpacking (b s) t d -> b t s d preserves stream meaning:
[[0, 0, 0, 0],
 [1, 1, 1, 1],
 [2, 2, 2, 2]]

The contrast is easiest to see in the code. The failing low-CE branch used the raw view shown above. Passing implementations in the same transformer task use an explicit unpack/repack discipline instead.

# from GivensOrthogonalManifoldHC (passing node in the same transformer task)
residuals = rearrange(residuals, '(b s) ... d -> b ... s d', s=s)
...
new_residuals = rearrange(rotated, 'b ... s d -> (b s) ... d')
...
residuals_grouped = rearrange(residuals, '(b s) ... d -> b ... s d', s=s)
output = rearrange(output, 'b ... s d -> (b s) ... d')

That is the real difference between the two implementations. Both preserve the same outer size. Only the explicit rearrange version preserves the meaning of the packed stream dimension. Once the stream axis is unpacked incorrectly, the node can preserve the same outer tensor size while silently entangling sequence positions. That is enough to break causality even though the code still looks contract-clean.

8. Why That Branch Passed and Propagated

The reward hacking nodes were propagated in CliffSearch because the in-loop gates were checking interface and benchmark-facing behavior, not the deeper tensor semantics that this task actually depended on. Once PoincareHC and then GivensWidthBetaDepthHC and PoincareGivensHybridHC started posting these numbers while staying reviewer-valid, the branch had everything the loop needed to propagate it: low loss, positive review scores, and a plausible design narrative. This exposes the reward hacking pattern in the evolutionary loop, which passed gates that were not sufficient for semantic bugs that hack the reward in the benchmark.

9. What the Task Needed to Specify More Explicitly

The task preamble was strong on interface and weak on stream semantics. It told the agents what symbols to define and what outer behavior to preserve, but it did not explicitly say how to prevent loss of causality or cross-sample batch leakage. It also did not spell out what the packed stream dimension meant or what invariances a valid hyper-connection had to preserve. A stronger contract would have said all of this directly:

  • residuals arrives as a runtime-packed tensor whose leading dimension represents B × S, not a dimension that may be reinterpreted by any size-compatible reshape.
  • Unpacking must respect the runtime packing convention, e.g. (b s) t d -> b t s d.
  • The editable geometry may change routing across the stream axis only; it may not alter sequence order, break causality, or entangle separate batch elements.
  • Changing future tokens must not change earlier-position logits.
  • Changing one batch element must not change logits for untouched samples.

10. Lessons and Takeaways

CliffSearch made auditing the full loop possible and let us pinpoint both the reward hacking in the evolutionary loop and the weaknesses in the gates that our post hoc audit unveiled.

To strengthen the in-loop gates:

  • The task preamble should specify these invariances explicitly, enforce causality, and explain the semantics of the relevant dimensions.
  • The runtime harness should be strengthened with the leakage and invariance probes we introduced post hoc.
  • The proposer should have access to longer code context so it can reason better about dimension semantics, though this is less effective than the two changes above.