Introduction

Evolve is a drop-in tool that learns which AI-assistant configuration works best for your codebase, automatically, via champion-vs-challenger A/B tests.

You keep using Claude Code, Cursor, or Aider normally. Every session, Evolve collects implicit signals (tests pass/fail, you typed /clear, you accepted the diff) and occasionally generates a slightly different configuration to try against your current champion. When the Bayesian posterior that the challenger outperforms the champion crosses 95%, Evolve promotes the challenger. Otherwise nothing changes.

Why this exists

You already tweak your CLAUDE.md, .cursorrules, and aider.conf.yml by hand, guessing what works. Evolve turns that guessing into an empirical test anchored to your sessions.

What it is not

Not a rewrite of the tool you use. Your Claude Code / Cursor / Aider setup keeps running exactly as before.
Not a cloud service. Everything runs locally in a single SQLite file at ~/.evolve/evolve.db. No telemetry, no outbound calls except the occasional mutation prompt (to your configured LLM provider).
Not an opinionated "best prompt" catalog. Evolve only knows what your sessions say works.

Install

cargo install evolve-cli       # or: pip install evolveai / npm install evolveai
evolve init claude-code         # in your project root

That's it. Go back to coding.

Quickstart

Pick your tool:

What happens when you run `evolve init`

Evolve detects the tool's config files in your project.
Installs a session hook so sessions report back on completion.
Writes an initial managed section inside your existing config files (bracketed by  /  markers — nothing outside those markers is ever touched).
Records the project + its starting champion config in ~/.evolve/evolve.db.

Use your tool normally from here. After ~100 sessions (or 7 days, whichever comes first), Evolve will propose its first challenger.

Inspecting progress

evolve status      # one-line per project: last session, current champion
evolve list        # all known projects
evolve dashboard   # opens http://localhost:8787 with a live UI

Quickstart: Claude Code

cd ~/projects/my-app
evolve init claude-code

That's the whole setup. Confirmation:

cat .claude/settings.json | jq '.hooks.Stop'
cat CLAUDE.md | grep -A 20 "evolve:start"

You should see:

A Stop hook entry calling evolve record-claude-code <transcript-path>.
A managed section in CLAUDE.md bracketed by evolve:start/evolve:end.

Now just use Claude Code. Every time you finish a session, the hook fires and Evolve captures:

Whether you typed /clear early (counted as negative).
Whether you hit cargo test / pytest / npm test and the exit code.
Whether you said "redo"/"wrong"/"perfect"/"thanks" in any message.
(Optional) Your explicit grade via evolve good / evolve bad.

Marking a session as good/bad

If Evolve guessed wrong about a session, override it:

evolve good    # marks the most recent session as a win
evolve bad     # ... or a loss

Explicit signals are weighted 5× implicit by default. One explicit grade typically overrides the noise of regex-matched "feedback."

Quickstart: Cursor

cd ~/projects/my-nextjs-app
evolve init cursor

This writes a managed section into .cursorrules. Cursor doesn't expose a session-end hook, so signal capture requires Evolve's proxy:

evolve proxy --for cursor &    # binds http://localhost:7777

Point Cursor's Settings → Models → Custom OpenAI Base URL at http://localhost:7777. Evolve now sees every request/response pair and records a suggestion_accepted or suggestion_rejected signal based on whether you keep the generated text.

Alternative

If you don't want to route through a proxy, just use evolve good / evolve bad after each meaningful session. The explicit signals alone are enough to drive evolution (just slower).

Quickstart: Aider

cd ~/projects/my-python-app
evolve init aider

This:

Writes a managed section into aider.conf.yml (created if missing).
Installs a git post-commit hook that calls evolve record-aider HEAD.

Now every aider commit fires the hook and records an aider_commit_observed signal. Explicit evolve good / evolve bad still applies.

Per-project test commands

If you want richer implicit signals (test pass/fail per commit), set:

# aider.conf.yml
test-cmd: pytest -q
lint-cmd: ruff check .

Aider runs these itself; Evolve reads the exit codes from the git hook context in a future version.

What to expect after install

The honest timeline. Days assume ~5 Claude Code sessions per day on the same project.

Day 0 — install

cargo install evolve-cli
cd ~/projects/my-app
evolve init claude-code
evolve doctor

evolve doctor should show [OK] for everything except experiment running (none yet) and sessions recorded (0 of 20).

Days 1-4 — accumulating

Use Claude Code normally. After every session, the Stop hook fires evolve record-claude-code automatically. You won't notice it.

evolve doctor after a few days shows:

[INFO] sessions recorded                12 (8 more before challenger generation)
[INFO] experiment running               none yet

This is expected. Evolve refuses to make decisions on too little data — that's a feature, not a bug.

Day 5 — first challenger

When session count hits 20, the next evolve record-claude-code call generates a challenger config and starts an experiment:

Recorded session 8e72...
Generated challenger 4f1c... in experiment a93b...

From this point forward, the SessionStart hook on Claude Code re-rolls the deployed config per session at 50/50 traffic share. About half your sessions run on the champion, half on the challenger.

Days 5-12 — A/B testing

evolve doctor now shows:

[OK]   experiment running               started 2026-04-29T10:14:22Z

Open the dashboard:

evolve dashboard

The "Success rate over time" panel shows two stacked bars per day: champion vs. challenger. The longer the bars, the more sessions on that day; the brighter the green, the better the average aggregate score.

After ~40 sessions split between the arms, every new session-end runs the Bayesian decision check. It will say one of:

experiment needs more data (default debug log) — keep going
experiment holding at posterior 0.62 — challenger isn't winning yet
Promoted challenger 4f1c... (posterior 0.96) — done, swap

Day ~12 — first promotion

When the posterior crosses 0.95, the challenger gets promoted to champion. The dashboard's "Promotion log" shows it. Your CLAUDE.md managed section is updated to the new champion's prompt prefix.

If the challenger never crosses 0.95, the experiment runs indefinitely until you manually evolve roll to try a different mutation, or the success rate diverges enough to fire on its own.

Cadence going forward

After the first promotion, the cycle repeats: 20 more sessions → another challenger → another A/B → another decision.

In practice: 1-3 promotions per month is realistic for a single active project. The compounding effect is small per generation but adds up.

When `evolve doctor` shows things wrong

Common patterns:

What you see	What it means
`[MISS] Stop hook installed`	`evolve init` didn't finish or was reverted. Re-run `evolve init claude-code`.
`[WARN] LLM available — no Anthropic key + no Ollama`	Mutator runs without LLM rewrite. Set `ANTHROPIC_API_KEY` or run Ollama for richer mutations.
`[INFO] sessions recorded — 0` after using Claude Code	Hook isn't firing. Check `.claude/settings.json` for the Stop hook entry.
`[INFO] experiment running — none yet` after 20+ sessions	Either the LLM is unreachable AND the rule-based mutators happened to fail, or `should_evolve` threshold logic differs. Run `evolve roll` to force generation.

evolve doctor is supposed to give you these answers without you needing to ask anyone. If it doesn't, file a bug — that's the tool's job and a doctor that misses things is a defect.

How Evolution Works

The loop

          +-- (session N occurs) --+
          |                        v
          |                   signals recorded
          |                        |
          |                        v
          |              enough sessions/time?
          |              /                  \
          |            no                  yes
          |            |                    |
          |            |               generate challenger
          |            v                    |
          +---- keep champion            A/B test starts
                                             |
                                             v
                                      P(challenger > champion) >= 0.95?
                                        /                          \
                                      yes                           no
                                       |                             |
                                       v                             v
                                 promote challenger              hold / keep running

Sessions → fitness

Each session's signals are collapsed into one fitness score in [0, 1]:

Implicit signals (tests passed, user accepted, no /clear) each contribute with weight 1.
Explicit signals (evolve good / evolve bad) each contribute with weight 5.
The weighted mean is clamped to [0, 1].

Fitness → promotion

Scores are binarized at 0.5 (win / loss) and fed into a Beta-binomial posterior per arm:

Champion ~ Beta(1 + champ_wins, 1 + champ_losses)
Challenger ~ Beta(1 + chall_wins, 1 + chall_losses)

A 10,000-sample Monte Carlo estimates P(challenger > champion). When that exceeds 0.95 and both arms have at least 20 sessions, the challenger is promoted.

Why 0.95 and not a straight A/B frequentist test? Bayesian posteriors let us peek — we can check after every session without inflating false positives.

Architecture

Evolve is a Cargo workspace of seven small crates:

Crate	Responsibility
`evolve-core`	`AgentConfig` types, schema DSL, mutation operators' trait, Bayesian promotion math. Zero I/O.
`evolve-storage`	SQLite via sqlx. Migrations, repositories for projects/configs/experiments/sessions/signals.
`evolve-llm`	Minimal LLM client (Haiku + Ollama) for occasional challenger generation.
`evolve-mutators`	Five mutation operators + weighted picker.
`evolve-adapters`	Adapter trait + ClaudeCodeAdapter / CursorAdapter / AiderAdapter.
`evolve-proxy`	OpenAI-compat HTTP proxy for Cursor-like tools.
`evolve-dashboard`	Local-only axum server serving a tiny HTML SPA + REST API.
`evolve-cli`	User-facing binary (`evolve`).

Python and TypeScript bindings (under bindings/) re-export the math engine via PyO3 and napi-rs.

Where data lives

~/.evolve/evolve.db — SQLite with everything. Single file, easy to back up or delete.
Adapter-managed files (CLAUDE.md, .cursorrules, aider.conf.yml) — only the managed section between markers is ever touched.
Hooks — .claude/settings.json (Stop hook), .git/hooks/post-commit.

Layering

 evolve-cli
     |
     +-- evolve-adapters -- evolve-core (AgentConfig)
     |
     +-- evolve-storage -- evolve-core (ids)
     |
     +-- evolve-mutators -- evolve-core + evolve-llm
     |
     +-- evolve-proxy
     |
     +-- evolve-dashboard -- evolve-storage

evolve-core is the foundation: zero runtime dependencies, safe for bindings to call synchronously. Everything else builds upward.

Cost & Privacy

Cost

Evolve is local-only. The only outbound calls are the occasional mutation prompts to your configured LLM. At the default cadence (roughly one challenger per project per week), using Haiku:

Per call: ~500 input + 200 output tokens ≈ $0.000375
Per project / month: ~$0.002

Use Ollama instead of Haiku and the cost is $0. The proxy (Cursor fallback) forwards to your existing OpenAI/Anthropic key — it adds no additional cost.

Privacy

Storage: a single local SQLite file. No cloud, no telemetry.
Signals DB: only metadata + numeric scores are persisted. No source code in the signals.payload_json column; the storage layer rejects payloads that look code-like (fn , def , class , SQL statements, etc.).
Prompts: not logged unless you pass --include-prompts on evolve init.
Outbound HTTP: only to the LLM provider you configure (for mutation generation). No analytics, no pings.

Opting out

evolve forget <project-id> removes all Evolve rows and restores the unmanaged versions of your config files.
evolve forget --all wipes everything.
Deleting ~/.evolve/ is always safe.

FAQ

How long before I see my first promotion?

Default thresholds require at least 20 sessions per arm before any decision, and generally ~100 combined sessions before a clear winner emerges. If you use your AI tool ~5 sessions per day, expect your first promotion around day 10-20.

What if the challenger is obviously worse?

The posterior calculation rejects it. Experiments that go below P = 0.05 for a long time are candidates for manual abort (evolve abort <experiment-id>).

Can I force a challenger right now?

Yes: evolve roll. This skips the normal schedule and immediately generates and deploys a challenger with 5% traffic share.

Does Evolve modify files I didn't ask it to?

No. Every file Evolve writes into is bracketed by evolve:start / evolve:end markers and only the content between those markers is owned by Evolve.

What if I hate the challenger?

evolve bad on the most recent session immediately pushes the challenger's posterior toward zero. After a few bad signals the challenger gets held indefinitely; the experiment can be aborted from the dashboard.

Why does it take a Haiku API call to generate challengers?

Because prompt variations that sound natural need an LLM. The other four mutators (rules, response style, model pref, permissions) run locally with no LLM calls. By default LlmRewriteMutator is picked with 50% weight, so half of all challengers cost zero.

Contributing

The repo is a standard Cargo workspace. Clone, then:

cargo test --workspace
cargo fmt --all -- --check
cargo clippy --workspace --all-targets -- -D warnings
cargo llvm-cov --workspace --summary-only    # coverage gate

Layout

crates/
  evolve-core/         # math, IDs, AgentConfig
  evolve-storage/      # SQLite
  evolve-llm/          # LLM clients (Anthropic + Ollama)
  evolve-mutators/     # mutation operators
  evolve-adapters/     # per-tool integration
  evolve-proxy/        # OpenAI-compat proxy
  evolve-dashboard/    # local web UI
  evolve-cli/          # evolve binary
bindings/
  python/              # PyO3 + maturin
  typescript/          # napi-rs

Plan

The full implementation plan and design decisions live at docs/plans/2026-04-23-evolve-validation-implementation.md.

Commit message style

Conventional-ish: feat(scope): short subject / fix(...) / test(...). See git log for examples.