Evaluation

Prompt engineering is iterative. You change a constraint, swap a model, add a guardrail, and then need to know: did things get better or worse? Talk Box’s eval system answers that question by running your chatbot against test queries, scoring responses with a judge model, and surfacing regressions before they reach users.

This guide covers single-bot evaluation, variant comparison, regression detection, and reporting.

Quick Start

The simplest eval takes a bot and queries, scores responses on three default dimensions (relevance, safety, instruction adherence), and returns structured results:

import talk_box as tb

bot = tb.ChatBot().persona_pack("code_reviewer")

results = tb.eval(bot, queries=["Review this function for issues"])
print(results.summary())

{'total_queries': 1, 'total_results': 1, 'variants': ['code_reviewer'], 'dimensions': ['relevance', 'safety', 'instruction_adherence'], 'scores_by_variant': {'code_reviewer': {'relevance': 0.92, 'safety': 1.0, 'instruction_adherence': 0.88}}, 'overall_scores': {'code_reviewer': 0.9333333333333332}}

The summary shows scores per variant per dimension, making it easy to see where a bot excels and where it falls short.

How It Works

The eval pipeline has three stages:

Run: Each query is sent to the bot via .chat(). The response and timing are captured.
Judge: A separate judge model scores each response on the requested dimensions (0.0 to 1.0).
Aggregate: Scores are collected into an EvalResults object with reporting and analysis methods.

The judge model receives the query, the bot’s response, and persona context (role, constraints, avoid topics) so it can assess instruction adherence meaningfully.

Scoring Dimensions

Talk Box ships with six scoring dimensions. By default, evaluations use relevance, safety, and instruction adherence:

Dimension	What It Measures
`relevance`	Is the response on-topic and directly helpful?
`safety`	Is it free from harmful or inappropriate content?
`instruction_adherence`	Does it follow the persona’s constraints?
`tone`	Does it match the expected communication style?
`completeness`	Does it address all aspects of the query?
`conciseness`	Is it appropriately brief without padding?

You can select any subset for evaluation:

import talk_box as tb

results = tb.eval(
    tb.ChatBot().persona_pack("financial_advisor"),
    queries=["How should I start saving for retirement?"],
    dimensions=[
        tb.EvalDimension.RELEVANCE,
        tb.EvalDimension.TONE,
        tb.EvalDimension.COMPLETENESS,
        tb.EvalDimension.CONCISENESS,
    ],
)

  relevance: 0.95 | Directly answers the retirement savings question
  tone: 0.90 | Professional and approachable
  completeness: 0.80 | Covers basics but could mention asset allocation
  conciseness: 0.92 | Clear and focused without filler

Comparing Variants

The most powerful use of eval is comparing two or more bot configurations side by side. Change a constraint, add a guardrail, swap a model, then measure the impact:

import talk_box as tb

results = tb.eval(
    variants={
        "baseline": tb.ChatBot().persona_pack("code_reviewer"),
        "with_citations": (
            tb.ChatBot()
            .persona_pack("code_reviewer")
            .guardrail(tb.must_cite_sources())
        ),
    },
    queries=[
        "Is this code vulnerable to SQL injection?",
        "How would you refactor this function?",
    ],
    judge="anthropic:claude-sonnet-4-6",
)

# See per-variant scores
for variant, scores in results.scores_by_variant().items():
    print(f"{variant}: {scores}")

baseline: {'relevance': 0.865, 'safety': 1.0, 'instruction_adherence': 0.775}
with_citations: {'relevance': 0.885, 'safety': 1.0, 'instruction_adherence': 0.91}

The variant comparison shows exactly how adding the must_cite_sources() guardrail affects instruction adherence scores, while verifying that relevance and safety don’t regress.

Regression Detection

When updating a persona or adding guardrails, you want confidence that quality didn’t drop. The eval_regression() function compares a “before” and “after” version and flags any dimension where scores drop below a threshold:

import talk_box as tb

before = tb.ChatBot().persona_pack("customer_support_tier1")
after = (
    tb.ChatBot()
    .persona_pack("customer_support_tier1")
    .guardrail(tb.max_response_length(100))  # Might hurt completeness
)

results = tb.eval_regression(
    before=before,
    after=after,
    queries=["How do I reset my password?", "What's your return policy?"],
    threshold=0.05,
)

regressions = results.regressions(baseline="before", threshold=0.05)
if regressions:
    print("Regressions detected:")
    for variant, dims in regressions.items():
        for dim, delta in dims.items():
            print(f"  {dim}: {delta:+.2f}")
else:
    print("No regressions. Safe to deploy.")

Regressions detected:
  relevance: -0.17

This pattern is ideal for CI pipelines: run eval_regression() on every PR that touches persona definitions or guardrail configuration, and fail the build if quality drops.

Model Comparison with eval_suite()

When you want to know which model runs a persona best, eval_suite() does the heavy lifting. Give it a persona name and a list of models; it creates a bot variant for each model, runs every query through all of them, and returns combined results:

import talk_box as tb

results = tb.eval_suite(
    "code_reviewer",
    models=[
        "anthropic:claude-sonnet-4-6",
        "github:gpt-4o",
    ],
    judge="anthropic:claude-sonnet-4-6",
)
results.to_great_table()

anthropic:claude-sonnet-4-6: relevance: 0.92 | safety: 1.00 | instruction_adherence: 0.90
github:gpt-4o: relevance: 0.75 | safety: 1.00 | instruction_adherence: 0.68

Each variant in the results is named after its model string, so it’s immediately clear which model performed best on each dimension.

Persona default guards

By default, eval_suite() applies the persona’s default guards (if any are defined). To evaluate the raw model without guardrails, pass default_guards=False:

results = tb.eval_suite(
    "financial_advisor",
    models=["anthropic:claude-sonnet-4-6", "github:gpt-4o"],
    default_guards=False,
    judge="anthropic:claude-sonnet-4-6",
)

GitHub Copilot models

Talk Box supports the github provider through chatlas’ ChatGithub, which gives you access to a wide range of models through your GitHub Copilot subscription. Use the github: prefix in model strings:

results = tb.eval_suite(
    "code_reviewer",
    models=[
        "github:gpt-4o",
        "github:o3-mini",
        "github:claude-sonnet-4-6",
    ],
    judge="anthropic:claude-sonnet-4-6",
)

Note

The github provider requires a GITHUB_TOKEN environment variable. If you have GitHub Copilot set up locally, this is typically your existing token.

Scorecard Export

After running an evaluation, you can export the results as a JSON scorecard for tracking over time, committing to your repository, or publishing to a docs site:

results.to_scorecard("scorecards/code_reviewer.json")

{
  "generated_at": "2026-05-12T14:43:41.283369+00:00",
  "config": {
    "persona": "code_reviewer",
    "models": [
      "anthropic:claude-sonnet-4-6",
      "github:gpt-4o"
    ],
    "type": "suite"
  },
  "variants": {
    "anthropic:claude-sonnet-4-6": {
      "dimensions": {
        "relevance": 0.92,
        "safety": 1.0,
        "instruction_adherence": 0.9
      },
      "overall": 0.94,
      "num_queries": 1
    },
    "github:gpt-4o": {
      "dimensions": {
        "relevance": 0.75,
        "safety": 1.0,
        "instruction_adherence": 0.68
      },
      "overall": 0.81,
      "num_queries": 1
    }
  }
}

The eval_suite() function also accepts a scorecard_path parameter for one-shot evaluate-and-export:

results = tb.eval_suite(
    "code_reviewer",
    models=["anthropic:claude-sonnet-4-6", "github:gpt-4o"],
    judge="anthropic:claude-sonnet-4-6",
    scorecard_path="scorecards/code_reviewer.json",
)

Using Persona Test Queries

Every persona ships with test_queries, which are representative questions that exercise the persona’s core capabilities. When you call eval() without explicit queries, it automatically uses these:

import talk_box as tb

# Uses code_reviewer's built-in test_queries
bot = tb.ChatBot().persona_pack("code_reviewer")
results = tb.eval(bot, judge="anthropic:claude-sonnet-4-6")
print(f"Evaluated {len(results)} queries")
print(f"Overall passed (>0.7): {results.passed()}")

Evaluated 4 queries
Overall passed (>0.7): True

This makes it easy to run a quality check on any persona with zero configuration.

Reporting

EvalResults supports multiple output formats for analysis and presentation.

Summary Statistics

The .summary() method returns a dictionary with aggregate scores:

import talk_box as tb

results = tb.eval(bot, queries=["How do I sort a list?"])
summary = results.summary()
print(f"Variants: {summary['variants']}")
print(f"Dimensions: {summary['dimensions']}")
print(f"Overall scores: {summary['overall_scores']}")

Great Tables Report

For visual comparison, .to_great_table() produces a formatted table showing mean scores per variant per dimension:

import talk_box as tb

results = tb.eval(
    variants={"v1": bot_v1, "v2": bot_v2},
    queries=queries,
)
results.to_great_table()

DataFrame Export

For custom analysis, .to_dataframe() exports all results to a pandas DataFrame with one row per (variant, query, dimension) combination:

import talk_box as tb

df = results.to_dataframe()
# Group by variant, compute mean scores
print(df.groupby("variant")["score"].mean())

The Pass/Fail Gate

The .passed() method checks whether all variants meet a minimum quality threshold. Use this in CI to gate deployments:

import talk_box as tb

bot = tb.ChatBot().persona_pack("customer_support_tier1")
results = tb.eval(bot, judge="anthropic:claude-sonnet-4-6")

assert results.passed(threshold=0.75), (
    f"Bot failed quality gate: {results.summary()['overall_scores']}"
)

Passed (threshold=0.75): True
Passed (threshold=0.95): False

A threshold of 0.75 is a reasonable default for most use cases. Raise it for high-stakes personas (financial, medical) and lower it when testing against smaller local models.

Public Scorecards

The scorecard_table() and sweep_table() functions turn scorecard and sweep JSON files into polished Great Tables with color-coded score cells, ready for embedding in docs, notebooks, or HTML reports.

Single-Persona Scorecard

After running eval_suite(), the .to_scorecard() output feeds directly into scorecard_table():

import talk_box as tb

# Run evaluation
results = tb.eval_suite(
    "code_reviewer",
    models=["anthropic:claude-sonnet-4-6", "github:gpt-4o"],
    judge="anthropic:claude-sonnet-4-6",
    scorecard_path="scorecards/code_reviewer.json",
)

# Render from the saved file
tb.scorecard_table("scorecards/code_reviewer.json")

Or directly from the in-memory scorecard dict:

tb.scorecard_table(results.to_scorecard())

	Dimensions			overall	queries
Scorecard: code_reviewer
2025-05-07 · Judge: anthropic:claude-sonnet-4-6
	relevance	safety	instruction_adherence	overall	queries
anthropic:claude-sonnet-4-6	0.960	1.000	0.920	0.960	3
github:gpt-4o	0.950	1.000	0.900	0.950	3

Sweep Summary

After a multi-persona sweep (via make eval-sweep or run_eval_sweep.py), render the combined results with sweep_table():

import talk_box as tb

tb.sweep_table("scorecards/_sweeps/2025-05-07T12-00-00.json")

	Models		status
Eval Sweep Results
2025-05-07 · 5/5 passed (threshold ≥ 0.7) · 120s · Judge: anthropic:claude-sonnet-4-6
	anthropic:claude-sonnet-4-6	github:gpt-4o	status
code_reviewer	0.960	0.950	PASS
financial_advisor	1.000	0.990	PASS
customer_support_tier1	0.970	0.960	PASS
data_analyst	0.980	0.970	PASS
technical_writer	0.980	0.770	PASS

Both functions accept either a file path or a dict, so they work seamlessly in scripts, notebooks, and Quarto documents. The tables use a red → yellow → green color scale on score columns, making it easy to spot weak spots at a glance.

Model Version Comparison

When a model provider releases a new version, use eval_model_update() to check whether your personas hold up. It runs the same persona against both model versions and flags any dimension that regresses beyond a threshold:

import talk_box as tb

results = tb.eval_model_update(
    "code_reviewer",
    before="anthropic:claude-sonnet-4-5",
    after="anthropic:claude-sonnet-4-6",
    judge="anthropic:claude-sonnet-4-6",
)

# Inspect regressions (dimensions that dropped > 5%)
drops = results.regressions()
if drops:
    print("Regressions found:", drops)
else:
    print("No regressions — safe to upgrade!")

# Visual comparison
results.to_great_table()

No regressions — safe to upgrade!

You can also write the comparison to a scorecard for tracking over time:

results = tb.eval_model_update(
    "financial_advisor",
    before="anthropic:claude-sonnet-4-5",
    after="anthropic:claude-sonnet-4-6",
    judge="anthropic:claude-sonnet-4-6",
    scorecard_path="scorecards/model_updates/financial_advisor_4-5_vs_4-6.json",
)
tb.scorecard_table(results.to_scorecard())