eval_model_update()
Compare persona behavior across two model versions.
Usage
eval_model_update(
persona,
*,
before,
after,
queries=None,
dimensions=None,
judge=None,
threshold=0.05,
default_guards=True,
scorecard_path=None
)A convenience wrapper around eval_suite() that builds two bot variants from the same persona (one per model string) and flags any dimensions where the newer model regresses compared to the older one.
Parameters
persona: str-
Persona name to evaluate (e.g.,
"code_reviewer"). before: str-
Provider:model string for the baseline model (e.g.,
"anthropic:claude-sonnet-4-5"). after: str-
Provider:model string for the new model (e.g.,
"anthropic:claude-sonnet-4-6"). queries: list[str | EvalCase] | None = None-
Queries to evaluate. Falls back to persona
test_queries. dimensions: list[EvalDimension] | None = None-
Scoring dimensions. Defaults to relevance, safety, instruction_adherence.
judge: str | "ChatBot | None" = None-
Judge model string or ChatBot.
threshold: float = 0.05-
Score drop to flag as a regression (default 0.05 = 5%).
default_guards: bool = True-
Whether to apply persona default guards.
scorecard_path: str | Path | None = None- If provided, writes the scorecard JSON to this path.
Returns
EvalResults-
Results with two variants (named after the model strings). Use
.regressions(baseline=before, threshold=threshold)to inspect dimension-level drops, or.to_great_table()/.scorecard_table()for a visual comparison.
Raises
ValueError-
If
before=andafter=are the same string.
Examples
import talk_box as tb
results = tb.eval_model_update(
"code_reviewer",
before="anthropic:claude-sonnet-4-5",
after="anthropic:claude-sonnet-4-6",
judge="anthropic:claude-sonnet-4-6",
)
# Check for regressions
drops = results.regressions()
if drops:
print("Regressions detected:", drops)
# Visual comparison
results.to_great_table()