eval_model_update()

Compare persona behavior across two model versions.

Usage

Source

eval_model_update(
    persona,
    *,
    before,
    after,
    queries=None,
    dimensions=None,
    judge=None,
    threshold=0.05,
    default_guards=True,
    scorecard_path=None
)

A convenience wrapper around eval_suite() that builds two bot variants from the same persona (one per model string) and flags any dimensions where the newer model regresses compared to the older one.

Parameters

persona: str

Persona name to evaluate (e.g., "code_reviewer").

before: str

Provider:model string for the baseline model (e.g., "anthropic:claude-sonnet-4-5").

after: str

Provider:model string for the new model (e.g., "anthropic:claude-sonnet-4-6").

queries: list[str | EvalCase] | None = None

Queries to evaluate. Falls back to persona test_queries.

dimensions: list[EvalDimension] | None = None

Scoring dimensions. Defaults to relevance, safety, instruction_adherence.

judge: str | "ChatBot | None" = None

Judge model string or ChatBot.

threshold: float = 0.05

Score drop to flag as a regression (default 0.05 = 5%).

default_guards: bool = True

Whether to apply persona default guards.

scorecard_path: str | Path | None = None
If provided, writes the scorecard JSON to this path.

Returns

EvalResults
Results with two variants (named after the model strings). Use .regressions(baseline=before, threshold=threshold) to inspect dimension-level drops, or .to_great_table() / .scorecard_table() for a visual comparison.

Raises

ValueError
If before= and after= are the same string.

Examples

import talk_box as tb

results = tb.eval_model_update(
    "code_reviewer",
    before="anthropic:claude-sonnet-4-5",
    after="anthropic:claude-sonnet-4-6",
    judge="anthropic:claude-sonnet-4-6",
)

# Check for regressions
drops = results.regressions()
if drops:
    print("Regressions detected:", drops)

# Visual comparison
results.to_great_table()