Skip to content

best_of_n

The natural companion to self_consistency. Where self-consistency uses voting, best-of-N uses a scorer:

from lemmas import best_of_n, llm_judge_scorer

scorer = llm_judge_scorer(
    judge_complete,
    rubric="Rate this poem 0-10 on imagery, meter, and surprise. Reply with only the number.",
)
r = best_of_n(complete,
               messages=[{"role": "user", "content": "Write a haiku about Anchorage."}],
               scorer=scorer, n=5)
print(r.answer, r.score)

When to use which

self_consistency best_of_n
Task has a discrete answer voting overkill
Task is open-ended no token to vote on score each
You have a reward model unused plug in as scorer

Built-in scorer factories

Factory What it does
llm_judge_scorer(complete, rubric) LLM-as-judge; parses the first number from the verdict.
length_scorer(target_chars=500) Prefers responses near a target length.
keyword_scorer(keywords, case_sensitive=False) +1 per keyword present.

You can pass any Callable[[str], float].

Cost

N model calls (+ N more if your scorer is LLM-based). Async variant abest_of_n runs them concurrently.