Skip to content

lemmas

reflexion

NORTHTEKDevs/lemmas

`reflexion` -- Iterative Self-Improvement¶

Shinn et al. 2023, arxiv:2303.11366 (Northeastern + MIT + Princeton)

A loop: attempt -> critique -> retry. Different from CoVe (one-shot plan + verify + revise), Reflexion iterates until a critic accepts the answer or a max-iterations cap is hit. The critique becomes part of the next prompt so the model learns from its mistakes within the same conversation.

from lemmas import reflexion, programmatic_critic

def run_tests(candidate: str) -> tuple[bool, str]:
    # Save the candidate code, run pytest, return (passed, feedback).
    ...

r = reflexion(complete, query="Write fizzbuzz in Python.",
               critic=programmatic_critic(run_tests),
               max_iterations=4)
print(r.final)
print("iterations:", r.iterations, "passed:", r.passed)

Critic factories¶

Factory	Use case
`llm_critic(complete, rubric)`	LLM-as-judge. Passes when verdict contains "PASS".
`programmatic_critic(fn)`	Wrap your own `(candidate) -> (passed, feedback)`.
`json_schema_critic(schema)`	Validate the candidate is JSON matching a schema.

When Reflexion beats `best_of_n`¶

When you have a verifiable signal: unit tests, exact-match against a known answer, a JSON schema, a deterministic checker. In those cases Reflexion is strictly stronger because the critic's feedback actually guides the retry -- best-of-N samples blindly.

Cost¶

Up to max_iterations attempts + critic calls. Stops early on first pass.

Schema-validated structured output¶

from lemmas import reflexion, json_schema_critic

schema = {
    "type": "object",
    "required": ["name", "age"],
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
    },
}
r = reflexion(complete,
               query="Generate JSON for a person named Alice, 32 years old.",
               critic=json_schema_critic(schema),
               max_iterations=3)