Task Contract and Runtime Harness
Every run starts from a task contract: what artifact is being evolved, which benchmark adapter executes
train/eval, and which benchmark primary metric defines quality. The loop remains generic, while task
behavior is injected by task_type, task_preamble, and
the runtime harness that validates node shape and task grounding before expensive execution.
In practice this makes CliffSearch task-agnostic. To run a custom task, you provide a benchmark that is
interfaced with the search runtime, executes the candidate artifact, and returns the primary metric in the
expected contract. The runtime harness is where task-specific checks live, so malformed artifacts can fail
early before full benchmark runs.
Node Schema
The canonical node payload is summary_md,
code_content, and theory_content. In
the config mode code_only, which we refer to on this website as
code+design-intent, formal theory_content is
normalized to empty, but summary_md still carries the design and ideation
principles, so the mode remains theory-bearing in that structured sense while keeping identical storage and
visualizer schema for compatibility.
This JSON node is the main evolutionary object: it is what is passed across pairing, crossover, mutation,
benchmark, review, persistence, and visualization, with benchmark and review fields attached back onto the
same node after evaluation.
Agent Roles
Pair selector operates on winner summaries only; crossover and mutation agents consume full parent context
and emit strict child JSON; reviewer consumes artifacts + benchmark payload + lineage metadata (including
parent context when available) and emits correctness/originality scores.
Operationally, the agent interface is JSON-in / JSON-out: agents receive structured JSON context from the
runtime and are expected to return strict JSON outputs that are validated before they are admitted into the
loop.
Winner Rule and Score Direction
Benchmark returns (primary_metric, higher_is_better). Runtime converts this to a
directional score that is always higher-is-better:
score = metric if higher-is-better else
score = -metric. Winners must be correct, original, and above the generation
median score.
Mutation System: Why Two Mutations
CliffSearch does not use a single generic mutation. It separates mutation into two operators because
discovery and repair are different search goals. Exploration mutation is used to increase novelty while
keeping task validity; correction mutation is used to recover from mathematical/runtime weaknesses and improve
reliability.
Routing rule is deterministic after benchmark + review:
correct & non-original -> exploration mutation,
otherwise correction mutation. This prevents “creative” operators from dominating
nodes that primarily need repair, and prevents conservative repair from collapsing diversity.
| Mutation Type | Trigger | Primary Goal | Typical Behavior |
| Exploration Mutation | Correct but non-original nodes | Novel mechanism search with valid contracts | Adjacent-domain transfer, broader redesign, new algorithmic hypotheses |
| Correction Mutation | Incorrect nodes or weak-score nodes | Correctness and robustness recovery | Targeted edits, conservative claims updates, minimal-risk fixes |
Both mutation agents receive full parent context (summary, code, theory in
code_and_theory, benchmark summary/details including errors when present, and lineage
metadata). Both must emit strict node JSON
(summary_md, code_content,
theory_content). Outputs are schema-validated before expensive benchmark execution.
If mutation output is invalid or the SDK call fails, runtime uses deterministic fallback child construction so
population closure remains guaranteed. This means mutation failure degrades search quality for that node, but
does not break the generation loop.
Generation Composition and Closure
Non-winners are routed by review/score signals: correct but non-original nodes go to exploration mutation;
incorrect or weak nodes go to correction mutation. Next population is composed with quota budgets
(elite + crossover + mutation + fill) to guarantee exact fixed-size closure every
generation, with deterministic fallback paths when agent outputs fail validation.