Introduction
Evolve is a drop-in tool that learns which AI-assistant configuration works best for your codebase, automatically, via champion-vs-challenger A/B tests.
You keep using Claude Code, Cursor, or Aider normally. Every session, Evolve
collects implicit signals (tests pass/fail, you typed /clear, you accepted
the diff) and occasionally generates a slightly different configuration to try
against your current champion. When the Bayesian posterior that the challenger
outperforms the champion crosses 95%, Evolve promotes the challenger. Otherwise
nothing changes.
Why this exists
You already tweak your CLAUDE.md, .cursorrules, and aider.conf.yml by
hand, guessing what works. Evolve turns that guessing into an empirical test
anchored to your sessions.
What it is not
- Not a rewrite of the tool you use. Your Claude Code / Cursor / Aider setup keeps running exactly as before.
- Not a cloud service. Everything runs locally in a single SQLite file at
~/.evolve/evolve.db. No telemetry, no outbound calls except the occasional mutation prompt (to your configured LLM provider). - Not an opinionated "best prompt" catalog. Evolve only knows what your sessions say works.
Install
cargo install evolve-cli # or: pip install evolveai / npm install evolveai
evolve init claude-code # in your project root
That's it. Go back to coding.
Quickstart
Pick your tool:
What happens when you run evolve init
- Evolve detects the tool's config files in your project.
- Installs a session hook so sessions report back on completion.
- Writes an initial managed section inside your existing config files
(bracketed by
<!-- evolve:start -->/<!-- evolve:end -->markers — nothing outside those markers is ever touched). - Records the project + its starting champion config in
~/.evolve/evolve.db.
Use your tool normally from here. After ~100 sessions (or 7 days, whichever comes first), Evolve will propose its first challenger.
Inspecting progress
evolve status # one-line per project: last session, current champion
evolve list # all known projects
evolve dashboard # opens http://localhost:8787 with a live UI
Quickstart: Claude Code
cd ~/projects/my-app
evolve init claude-code
That's the whole setup. Confirmation:
cat .claude/settings.json | jq '.hooks.Stop'
cat CLAUDE.md | grep -A 20 "evolve:start"
You should see:
- A
Stophook entry callingevolve record-claude-code <transcript-path>. - A managed section in
CLAUDE.mdbracketed byevolve:start/evolve:end.
Now just use Claude Code. Every time you finish a session, the hook fires and Evolve captures:
- Whether you typed
/clearearly (counted as negative). - Whether you hit
cargo test/pytest/npm testand the exit code. - Whether you said "redo"/"wrong"/"perfect"/"thanks" in any message.
- (Optional) Your explicit grade via
evolve good/evolve bad.
Marking a session as good/bad
If Evolve guessed wrong about a session, override it:
evolve good # marks the most recent session as a win
evolve bad # ... or a loss
Explicit signals are weighted 5× implicit by default. One explicit grade typically overrides the noise of regex-matched "feedback."
Quickstart: Cursor
cd ~/projects/my-nextjs-app
evolve init cursor
This writes a managed section into .cursorrules. Cursor doesn't expose a
session-end hook, so signal capture requires Evolve's proxy:
evolve proxy --for cursor & # binds http://localhost:7777
Point Cursor's Settings → Models → Custom OpenAI Base URL at
http://localhost:7777. Evolve now sees every request/response pair and
records a suggestion_accepted or suggestion_rejected signal based on
whether you keep the generated text.
Alternative
If you don't want to route through a proxy, just use evolve good / evolve bad after each meaningful session. The explicit signals alone are enough to
drive evolution (just slower).
Quickstart: Aider
cd ~/projects/my-python-app
evolve init aider
This:
- Writes a managed section into
aider.conf.yml(created if missing). - Installs a git
post-commithook that callsevolve record-aider HEAD.
Now every aider commit fires the hook and records an
aider_commit_observed signal. Explicit evolve good / evolve bad still
applies.
Per-project test commands
If you want richer implicit signals (test pass/fail per commit), set:
# aider.conf.yml
test-cmd: pytest -q
lint-cmd: ruff check .
Aider runs these itself; Evolve reads the exit codes from the git hook context in a future version.
What to expect after install
The honest timeline. Days assume ~5 Claude Code sessions per day on the same project.
Day 0 — install
cargo install evolve-cli
cd ~/projects/my-app
evolve init claude-code
evolve doctor
evolve doctor should show [OK] for everything except experiment running (none yet) and sessions recorded (0 of 20).
Days 1-4 — accumulating
Use Claude Code normally. After every session, the Stop hook fires evolve record-claude-code automatically. You won't notice it.
evolve doctor after a few days shows:
[INFO] sessions recorded 12 (8 more before challenger generation)
[INFO] experiment running none yet
This is expected. Evolve refuses to make decisions on too little data — that's a feature, not a bug.
Day 5 — first challenger
When session count hits 20, the next evolve record-claude-code call generates a challenger config and starts an experiment:
Recorded session 8e72...
Generated challenger 4f1c... in experiment a93b...
From this point forward, the SessionStart hook on Claude Code re-rolls the deployed config per session at 50/50 traffic share. About half your sessions run on the champion, half on the challenger.
Days 5-12 — A/B testing
evolve doctor now shows:
[OK] experiment running started 2026-04-29T10:14:22Z
Open the dashboard:
evolve dashboard
The "Success rate over time" panel shows two stacked bars per day: champion vs. challenger. The longer the bars, the more sessions on that day; the brighter the green, the better the average aggregate score.
After ~40 sessions split between the arms, every new session-end runs the Bayesian decision check. It will say one of:
experiment needs more data(default debug log) — keep goingexperiment holding at posterior 0.62— challenger isn't winning yetPromoted challenger 4f1c... (posterior 0.96)— done, swap
Day ~12 — first promotion
When the posterior crosses 0.95, the challenger gets promoted to champion. The dashboard's "Promotion log" shows it. Your CLAUDE.md managed section is updated to the new champion's prompt prefix.
If the challenger never crosses 0.95, the experiment runs indefinitely until you manually evolve roll to try a different mutation, or the success rate diverges enough to fire on its own.
Cadence going forward
After the first promotion, the cycle repeats: 20 more sessions → another challenger → another A/B → another decision.
In practice: 1-3 promotions per month is realistic for a single active project. The compounding effect is small per generation but adds up.
When evolve doctor shows things wrong
Common patterns:
| What you see | What it means |
|---|---|
[MISS] Stop hook installed | evolve init didn't finish or was reverted. Re-run evolve init claude-code. |
[WARN] LLM available — no Anthropic key + no Ollama | Mutator runs without LLM rewrite. Set ANTHROPIC_API_KEY or run Ollama for richer mutations. |
[INFO] sessions recorded — 0 after using Claude Code | Hook isn't firing. Check .claude/settings.json for the Stop hook entry. |
[INFO] experiment running — none yet after 20+ sessions | Either the LLM is unreachable AND the rule-based mutators happened to fail, or should_evolve threshold logic differs. Run evolve roll to force generation. |
evolve doctor is supposed to give you these answers without you needing to ask anyone. If it doesn't, file a bug — that's the tool's job and a doctor that misses things is a defect.
How Evolution Works
The loop
+-- (session N occurs) --+
| v
| signals recorded
| |
| v
| enough sessions/time?
| / \
| no yes
| | |
| | generate challenger
| v |
+---- keep champion A/B test starts
|
v
P(challenger > champion) >= 0.95?
/ \
yes no
| |
v v
promote challenger hold / keep running
Sessions → fitness
Each session's signals are collapsed into one fitness score in [0, 1]:
- Implicit signals (tests passed, user accepted, no
/clear) each contribute with weight 1. - Explicit signals (
evolve good/evolve bad) each contribute with weight 5. - The weighted mean is clamped to
[0, 1].
Fitness → promotion
Scores are binarized at 0.5 (win / loss) and fed into a Beta-binomial
posterior per arm:
Champion ~ Beta(1 + champ_wins, 1 + champ_losses)Challenger ~ Beta(1 + chall_wins, 1 + chall_losses)
A 10,000-sample Monte Carlo estimates P(challenger > champion). When that
exceeds 0.95 and both arms have at least 20 sessions, the challenger is
promoted.
Why 0.95 and not a straight A/B frequentist test? Bayesian posteriors let us peek — we can check after every session without inflating false positives.
Architecture
Evolve is a Cargo workspace of seven small crates:
| Crate | Responsibility |
|---|---|
evolve-core | AgentConfig types, schema DSL, mutation operators' trait, Bayesian promotion math. Zero I/O. |
evolve-storage | SQLite via sqlx. Migrations, repositories for projects/configs/experiments/sessions/signals. |
evolve-llm | Minimal LLM client (Haiku + Ollama) for occasional challenger generation. |
evolve-mutators | Five mutation operators + weighted picker. |
evolve-adapters | Adapter trait + ClaudeCodeAdapter / CursorAdapter / AiderAdapter. |
evolve-proxy | OpenAI-compat HTTP proxy for Cursor-like tools. |
evolve-dashboard | Local-only axum server serving a tiny HTML SPA + REST API. |
evolve-cli | User-facing binary (evolve). |
Python and TypeScript bindings (under bindings/) re-export the math engine
via PyO3 and napi-rs.
Where data lives
~/.evolve/evolve.db— SQLite with everything. Single file, easy to back up or delete.- Adapter-managed files (
CLAUDE.md,.cursorrules,aider.conf.yml) — only the managed section between markers is ever touched. - Hooks —
.claude/settings.json(Stop hook),.git/hooks/post-commit.
Layering
evolve-cli
|
+-- evolve-adapters -- evolve-core (AgentConfig)
|
+-- evolve-storage -- evolve-core (ids)
|
+-- evolve-mutators -- evolve-core + evolve-llm
|
+-- evolve-proxy
|
+-- evolve-dashboard -- evolve-storage
evolve-core is the foundation: zero runtime dependencies, safe for
bindings to call synchronously. Everything else builds upward.
Cost & Privacy
Cost
Evolve is local-only. The only outbound calls are the occasional mutation prompts to your configured LLM. At the default cadence (roughly one challenger per project per week), using Haiku:
- Per call: ~500 input + 200 output tokens ≈ $0.000375
- Per project / month: ~$0.002
Use Ollama instead of Haiku and the cost is $0. The proxy (Cursor fallback) forwards to your existing OpenAI/Anthropic key — it adds no additional cost.
Privacy
- Storage: a single local SQLite file. No cloud, no telemetry.
- Signals DB: only metadata + numeric scores are persisted. No source code
in the
signals.payload_jsoncolumn; the storage layer rejects payloads that look code-like (fn,def,class, SQL statements, etc.). - Prompts: not logged unless you pass
--include-promptsonevolve init. - Outbound HTTP: only to the LLM provider you configure (for mutation generation). No analytics, no pings.
Opting out
evolve forget <project-id>removes all Evolve rows and restores the unmanaged versions of your config files.evolve forget --allwipes everything.- Deleting
~/.evolve/is always safe.
FAQ
How long before I see my first promotion?
Default thresholds require at least 20 sessions per arm before any decision, and generally ~100 combined sessions before a clear winner emerges. If you use your AI tool ~5 sessions per day, expect your first promotion around day 10-20.
What if the challenger is obviously worse?
The posterior calculation rejects it. Experiments that go below P = 0.05
for a long time are candidates for manual abort (evolve abort <experiment-id>).
Can I force a challenger right now?
Yes: evolve roll. This skips the normal schedule and immediately generates
and deploys a challenger with 5% traffic share.
Does Evolve modify files I didn't ask it to?
No. Every file Evolve writes into is bracketed by evolve:start / evolve:end
markers and only the content between those markers is owned by Evolve.
What if I hate the challenger?
evolve bad on the most recent session immediately pushes the challenger's
posterior toward zero. After a few bad signals the challenger gets held
indefinitely; the experiment can be aborted from the dashboard.
Why does it take a Haiku API call to generate challengers?
Because prompt variations that sound natural need an LLM. The other four
mutators (rules, response style, model pref, permissions) run locally with
no LLM calls. By default LlmRewriteMutator is picked with 50% weight, so
half of all challengers cost zero.
Contributing
The repo is a standard Cargo workspace. Clone, then:
cargo test --workspace
cargo fmt --all -- --check
cargo clippy --workspace --all-targets -- -D warnings
cargo llvm-cov --workspace --summary-only # coverage gate
Layout
crates/
evolve-core/ # math, IDs, AgentConfig
evolve-storage/ # SQLite
evolve-llm/ # LLM clients (Anthropic + Ollama)
evolve-mutators/ # mutation operators
evolve-adapters/ # per-tool integration
evolve-proxy/ # OpenAI-compat proxy
evolve-dashboard/ # local web UI
evolve-cli/ # evolve binary
bindings/
python/ # PyO3 + maturin
typescript/ # napi-rs
Plan
The full implementation plan and design decisions live at
docs/plans/2026-04-23-evolve-validation-implementation.md.
Commit message style
Conventional-ish: feat(scope): short subject / fix(...) / test(...).
See git log for examples.