eval_suite()
Evaluate a persona across multiple models (model comparison matrix).
Usage
eval_suite(
persona,
*,
models,
queries=None,
dimensions=None,
judge=None,
default_guards=True,
scorecard_path=None
)Creates a variant for each model, runs the persona’s test queries (or explicit queries) through each one, scores with a judge, and returns a combined EvalResults where each variant is named after its model string.
Parameters
persona: str-
Persona name to evaluate (e.g.,
"code_reviewer"). models: list[str]-
List of provider:model strings (e.g.,
["anthropic:claude-sonnet-4-6", "github:gpt-4o"]). queries: list[str | EvalCase] | None = None-
Queries to evaluate. Falls back to persona
test_queries. dimensions: list[EvalDimension] | None = None-
Scoring dimensions. Defaults to relevance, safety, instruction_adherence.
judge: str | "ChatBot | None" = None-
Judge model string or ChatBot.
default_guards: bool = True-
Whether to apply persona default guards (passed through to
persona_pack()). scorecard_path: str | Path | None = None- If provided, writes the scorecard JSON to this path after evaluation.
Returns
EvalResults- Combined results with one variant per model.
Raises
ValueError- If models is empty.
Examples
Compare a persona across two providers:
import talk_box as tb
results = tb.eval_suite(
"code_reviewer",
models=["anthropic:claude-sonnet-4-6", "github:gpt-4o"],
judge="anthropic:claude-sonnet-4-6",
)
results.to_scorecard("scorecards/code_reviewer.json")
results.to_great_table()