eval_regression()

Compare two bot versions and flag regressions.

Usage

eval_regression(
    before,
    after,
    *,
    queries=None,
    dimensions=None,
    judge=None,
    threshold=0.05
)

A convenience wrapper around eval() that runs both versions against the same queries and pre-computes regression analysis.

Parameters

before: ‘ChatBot’: The baseline bot (e.g., current production version).
after: ‘ChatBot’: The new bot (e.g., with updated prompt or guardrails).
queries: list[str | EvalCase] | None = None: Queries to evaluate. Falls back to persona test_queries.
dimensions: list[EvalDimension] | None = None: Scoring dimensions. Defaults to relevance, safety, instruction_adherence.
judge: str | "ChatBot | None" = None: Judge model string or ChatBot.
threshold: float = 0.05: Score drop threshold to flag as a regression.

EvalResults: Results with regression analysis accessible via .regressions().