Skip to content

reflexion -- Iterative Self-Improvement

Shinn et al. 2023, arxiv:2303.11366 (Northeastern + MIT + Princeton)

A loop: attempt -> critique -> retry. Different from CoVe (one-shot plan + verify + revise), Reflexion iterates until a critic accepts the answer or a max-iterations cap is hit. The critique becomes part of the next prompt so the model learns from its mistakes within the same conversation.

from lemmas import reflexion, programmatic_critic

def run_tests(candidate: str) -> tuple[bool, str]:
    # Save the candidate code, run pytest, return (passed, feedback).
    ...

r = reflexion(complete, query="Write fizzbuzz in Python.",
               critic=programmatic_critic(run_tests),
               max_iterations=4)
print(r.final)
print("iterations:", r.iterations, "passed:", r.passed)

Critic factories

Factory Use case
llm_critic(complete, rubric) LLM-as-judge. Passes when verdict contains "PASS".
programmatic_critic(fn) Wrap your own (candidate) -> (passed, feedback).
json_schema_critic(schema) Validate the candidate is JSON matching a schema.

When Reflexion beats best_of_n

When you have a verifiable signal: unit tests, exact-match against a known answer, a JSON schema, a deterministic checker. In those cases Reflexion is strictly stronger because the critic's feedback actually guides the retry -- best-of-N samples blindly.

Cost

Up to max_iterations attempts + critic calls. Stops early on first pass.

Schema-validated structured output

from lemmas import reflexion, json_schema_critic

schema = {
    "type": "object",
    "required": ["name", "age"],
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
    },
}
r = reflexion(complete,
               query="Generate JSON for a person named Alice, 32 years old.",
               critic=json_schema_critic(schema),
               max_iterations=3)