eval_suite()

Evaluate a persona across multiple models (model comparison matrix).

Usage

Source

eval_suite(
    persona,
    *,
    models,
    queries=None,
    dimensions=None,
    judge=None,
    default_guards=True,
    scorecard_path=None
)

Creates a variant for each model, runs the persona’s test queries (or explicit queries) through each one, scores with a judge, and returns a combined EvalResults where each variant is named after its model string.

Parameters

persona: str

Persona name to evaluate (e.g., "code_reviewer").

models: list[str]

List of provider:model strings (e.g., ["anthropic:claude-sonnet-4-6", "github:gpt-4o"]).

queries: list[str | EvalCase] | None = None

Queries to evaluate. Falls back to persona test_queries.

dimensions: list[EvalDimension] | None = None

Scoring dimensions. Defaults to relevance, safety, instruction_adherence.

judge: str | "ChatBot | None" = None

Judge model string or ChatBot.

default_guards: bool = True

Whether to apply persona default guards (passed through to persona_pack()).

scorecard_path: str | Path | None = None
If provided, writes the scorecard JSON to this path after evaluation.

Returns

EvalResults
Combined results with one variant per model.

Raises

ValueError
If models is empty.

Examples

Compare a persona across two providers:

import talk_box as tb

results = tb.eval_suite(
    "code_reviewer",
    models=["anthropic:claude-sonnet-4-6", "github:gpt-4o"],
    judge="anthropic:claude-sonnet-4-6",
)
results.to_scorecard("scorecards/code_reviewer.json")
results.to_great_table()