Framework

CliffSearch combines evolutionary search with role-specialized LLM operators and strict runtime contracts. The main evolutionary object is a structured JSON node that carries the artifact fields used throughout the cycle, while benchmarks provide task metrics and reviewer outputs gate survival through correctness and originality.

Framework Semantics

Task Contract

Every run starts from a task contract: what artifact is being evolved, which benchmark adapter executes train/eval, and which benchmark primary metric defines quality. The loop remains generic, while task behavior is injected by task_type, task_preamble, and runtime grounding checks.

In practice this makes CliffSearch task-agnostic. To run a custom task, you provide a benchmark that is interfaced with the search runtime, executes the candidate artifact, and returns the primary metric in the expected contract.

Node Schema

The canonical node payload is summary_md, code_content, and theory_content. In code_only, theory is normalized to empty while keeping identical storage and visualizer schema for compatibility.

This JSON node is the main evolutionary object: it is what is passed across pairing, crossover, mutation, benchmark, review, persistence, and visualization, with benchmark and review fields attached back onto the same node after evaluation.

Agent Roles

Pair selector operates on winner summaries only; crossover and mutation agents consume full parent context and emit strict child JSON; reviewer consumes artifacts + benchmark payload + lineage metadata (including parent context when available) and emits correctness/originality scores.

Operationally, the agent interface is JSON-in / JSON-out: agents receive structured JSON context from the runtime and are expected to return strict JSON outputs that are validated before they are admitted into the loop.

Winner Rule and Score Direction

Benchmark returns (primary_metric, higher_is_better). Runtime converts this to a directional score that is always higher-is-better: score = metric if higher-is-better else score = -metric. Winners must be correct, original, and above the generation median score.

Mutation System: Why Two Mutations

CliffSearch does not use a single generic mutation. It separates mutation into two operators because discovery and repair are different search goals. Exploration mutation is used to increase novelty while keeping task validity; correction mutation is used to recover from mathematical/runtime weaknesses and improve reliability.

Routing rule is deterministic after benchmark + review: correct & non-original -> exploration mutation, otherwise correction mutation. This prevents “creative” operators from dominating nodes that primarily need repair, and prevents conservative repair from collapsing diversity.

Mutation Type Trigger Primary Goal Typical Behavior
Exploration Mutation Correct but non-original nodes Novel mechanism search with valid contracts Adjacent-domain transfer, broader redesign, new algorithmic hypotheses
Correction Mutation Incorrect nodes or weak-score nodes Correctness and robustness recovery Targeted edits, conservative claims updates, minimal-risk fixes

Both mutation agents receive full parent context (summary, code, theory in code_and_theory, benchmark summary/details including errors when present, and lineage metadata). Both must emit strict node JSON (summary_md, code_content, theory_content). Outputs are schema-validated before expensive benchmark execution.

If mutation output is invalid or the SDK call fails, runtime uses deterministic fallback child construction so population closure remains guaranteed. This means mutation failure degrades search quality for that node, but does not break the generation loop.

Generation Composition and Closure

Non-winners are routed by review/score signals: correct but non-original nodes go to exploration mutation; incorrect or weak nodes go to correction mutation. Next population is composed with quota budgets (elite + crossover + mutation + fill) to guarantee exact fixed-size closure every generation, with deterministic fallback paths when agent outputs fail validation.

Generation Workflow

flowchart TD
  P["Population P(g)"] --> S["Winner Selection\n(correct & original & benchmark gate)"]
  P --> E["Exploration Bucket\n(correct & non-original)"]
  P --> C["Correction Bucket\n(other nodes)"]
  S --> PS["Pair Selector Agent\n(summary-only)"]
  PS --> X["Crossover Agent"]
  E --> EM["Exploration Mutation Agent"]
  C --> CM["Correction Mutation Agent"]
  X --> N["Compose Next Population\n(quota + elite + fill)"]
  EM --> N
  CM --> N
  N --> B["Benchmark Adapter"]
  B --> R["Reviewer Agent\n(code + theory + benchmark + lineage)"]
  R --> G["Persist Snapshot\nP(g+1)"]
        

Distributed Multi-Island Runtime

flowchart TD
  I1["Island 1\n(1 machine)"] --> O["Shared Orchestration Block"]
  I2["Island 2\n(1 machine)"] --> O
  I3["Island 3\n(1 machine)"] --> O
  I4["Island 4\n(1 machine)"] --> O
  O --> D["Shared Disk\n(outbox/inbox/state)"]
  D --> O
  O --> I1
  O --> I2
  O --> I3
  O --> I4
        

Execution Notes

CPU Side

SDK calls (pairing, crossover, mutation, review) run on CPU worker queues with bounded in-flight limits.

GPU Side

Benchmark adapters consume node artifacts and execute train/eval on assigned GPU slots. Results are persisted as node benchmark payloads and included in reviewer context.

Persistence and Auditability

Each generation persists node-level artifacts, population.json, and both generation-local plus cumulative ga_data.json. This enables deterministic replay, visualizer snapshots, and post-hoc extraction for best-node reports.