How Evolution Works
The loop
+-- (session N occurs) --+
| v
| signals recorded
| |
| v
| enough sessions/time?
| / \
| no yes
| | |
| | generate challenger
| v |
+---- keep champion A/B test starts
|
v
P(challenger > champion) >= 0.95?
/ \
yes no
| |
v v
promote challenger hold / keep running
Sessions → fitness
Each session's signals are collapsed into one fitness score in [0, 1]:
- Implicit signals (tests passed, user accepted, no
/clear) each contribute with weight 1. - Explicit signals (
evolve good/evolve bad) each contribute with weight 5. - The weighted mean is clamped to
[0, 1].
Fitness → promotion
Scores are binarized at 0.5 (win / loss) and fed into a Beta-binomial
posterior per arm:
Champion ~ Beta(1 + champ_wins, 1 + champ_losses)Challenger ~ Beta(1 + chall_wins, 1 + chall_losses)
A 10,000-sample Monte Carlo estimates P(challenger > champion). When that
exceeds 0.95 and both arms have at least 20 sessions, the challenger is
promoted.
Why 0.95 and not a straight A/B frequentist test? Bayesian posteriors let us peek — we can check after every session without inflating false positives.