Task Contract
Every run starts from a task contract: what artifact is being evolved, which benchmark adapter executes
train/eval, and which benchmark primary metric defines quality. The loop remains generic, while task
behavior is injected by task_type, task_preamble, and
runtime grounding checks.
In practice this makes CliffSearch task-agnostic. To run a custom task, you provide a benchmark that is
interfaced with the search runtime, executes the candidate artifact, and returns the primary metric in the
expected contract.
Node Schema
The canonical node payload is summary_md,
code_content, and theory_content. In
code_only, theory is normalized to empty while keeping identical storage and
visualizer schema for compatibility.
This JSON node is the main evolutionary object: it is what is passed across pairing, crossover, mutation,
benchmark, review, persistence, and visualization, with benchmark and review fields attached back onto the
same node after evaluation.
Agent Roles
Pair selector operates on winner summaries only; crossover and mutation agents consume full parent context
and emit strict child JSON; reviewer consumes artifacts + benchmark payload + lineage metadata (including
parent context when available) and emits correctness/originality scores.
Operationally, the agent interface is JSON-in / JSON-out: agents receive structured JSON context from the
runtime and are expected to return strict JSON outputs that are validated before they are admitted into the
loop.
Winner Rule and Score Direction
Benchmark returns (primary_metric, higher_is_better). Runtime converts this to a
directional score that is always higher-is-better:
score = metric if higher-is-better else
score = -metric. Winners must be correct, original, and above the generation
median score.
Mutation System: Why Two Mutations
CliffSearch does not use a single generic mutation. It separates mutation into two operators because
discovery and repair are different search goals. Exploration mutation is used to increase novelty while
keeping task validity; correction mutation is used to recover from mathematical/runtime weaknesses and improve
reliability.
Routing rule is deterministic after benchmark + review:
correct & non-original -> exploration mutation,
otherwise correction mutation. This prevents “creative” operators from dominating
nodes that primarily need repair, and prevents conservative repair from collapsing diversity.
| Mutation Type | Trigger | Primary Goal | Typical Behavior |
| Exploration Mutation | Correct but non-original nodes | Novel mechanism search with valid contracts | Adjacent-domain transfer, broader redesign, new algorithmic hypotheses |
| Correction Mutation | Incorrect nodes or weak-score nodes | Correctness and robustness recovery | Targeted edits, conservative claims updates, minimal-risk fixes |
Both mutation agents receive full parent context (summary, code, theory in
code_and_theory, benchmark summary/details including errors when present, and lineage
metadata). Both must emit strict node JSON
(summary_md, code_content,
theory_content). Outputs are schema-validated before expensive benchmark execution.
If mutation output is invalid or the SDK call fails, runtime uses deterministic fallback child construction so
population closure remains guaranteed. This means mutation failure degrades search quality for that node, but
does not break the generation loop.
Generation Composition and Closure
Non-winners are routed by review/score signals: correct but non-original nodes go to exploration mutation;
incorrect or weak nodes go to correction mutation. Next population is composed with quota budgets
(elite + crossover + mutation + fill) to guarantee exact fixed-size closure every
generation, with deterministic fallback paths when agent outputs fail validation.