import talk_box as tb
bot = tb.ChatBot().persona_pack("code_reviewer")
results = tb.eval(bot, queries=["Review this function for issues"])
print(results.summary())Evaluation
Prompt engineering is iterative. You change a constraint, swap a model, add a guardrail, and then need to know: did things get better or worse? Talk Box’s eval system answers that question by running your chatbot against test queries, scoring responses with a judge model, and surfacing regressions before they reach users.
This guide covers single-bot evaluation, variant comparison, regression detection, and reporting.
Quick Start
The simplest eval takes a bot and queries, scores responses on three default dimensions (relevance, safety, instruction adherence), and returns structured results:
{'total_queries': 1, 'total_results': 1, 'variants': ['code_reviewer'], 'dimensions': ['relevance', 'safety', 'instruction_adherence'], 'scores_by_variant': {'code_reviewer': {'relevance': 0.92, 'safety': 1.0, 'instruction_adherence': 0.88}}, 'overall_scores': {'code_reviewer': 0.9333333333333332}}
The summary shows scores per variant per dimension, making it easy to see where a bot excels and where it falls short.
How It Works
The eval pipeline has three stages:
- Run: Each query is sent to the bot via
.chat(). The response and timing are captured. - Judge: A separate judge model scores each response on the requested dimensions (0.0 to 1.0).
- Aggregate: Scores are collected into an EvalResults object with reporting and analysis methods.
The judge model receives the query, the bot’s response, and persona context (role, constraints, avoid topics) so it can assess instruction adherence meaningfully.
Scoring Dimensions
Talk Box ships with six scoring dimensions. By default, evaluations use relevance, safety, and instruction adherence:
| Dimension | What It Measures |
|---|---|
relevance |
Is the response on-topic and directly helpful? |
safety |
Is it free from harmful or inappropriate content? |
instruction_adherence |
Does it follow the persona’s constraints? |
tone |
Does it match the expected communication style? |
completeness |
Does it address all aspects of the query? |
conciseness |
Is it appropriately brief without padding? |
You can select any subset for evaluation:
import talk_box as tb
results = tb.eval(
tb.ChatBot().persona_pack("financial_advisor"),
queries=["How should I start saving for retirement?"],
dimensions=[
tb.EvalDimension.RELEVANCE,
tb.EvalDimension.TONE,
tb.EvalDimension.COMPLETENESS,
tb.EvalDimension.CONCISENESS,
],
) relevance: 0.95 | Directly answers the retirement savings question
tone: 0.90 | Professional and approachable
completeness: 0.80 | Covers basics but could mention asset allocation
conciseness: 0.92 | Clear and focused without filler
Comparing Variants
The most powerful use of eval is comparing two or more bot configurations side by side. Change a constraint, add a guardrail, swap a model, then measure the impact:
import talk_box as tb
results = tb.eval(
variants={
"baseline": tb.ChatBot().persona_pack("code_reviewer"),
"with_citations": (
tb.ChatBot()
.persona_pack("code_reviewer")
.guardrail(tb.must_cite_sources())
),
},
queries=[
"Is this code vulnerable to SQL injection?",
"How would you refactor this function?",
],
judge="anthropic:claude-sonnet-4-6",
)
# See per-variant scores
for variant, scores in results.scores_by_variant().items():
print(f"{variant}: {scores}")baseline: {'relevance': 0.865, 'safety': 1.0, 'instruction_adherence': 0.775}
with_citations: {'relevance': 0.885, 'safety': 1.0, 'instruction_adherence': 0.91}
The variant comparison shows exactly how adding the must_cite_sources() guardrail affects instruction adherence scores, while verifying that relevance and safety don’t regress.
Regression Detection
When updating a persona or adding guardrails, you want confidence that quality didn’t drop. The eval_regression() function compares a “before” and “after” version and flags any dimension where scores drop below a threshold:
import talk_box as tb
before = tb.ChatBot().persona_pack("customer_support_tier1")
after = (
tb.ChatBot()
.persona_pack("customer_support_tier1")
.guardrail(tb.max_response_length(100)) # Might hurt completeness
)
results = tb.eval_regression(
before=before,
after=after,
queries=["How do I reset my password?", "What's your return policy?"],
threshold=0.05,
)
regressions = results.regressions(baseline="before", threshold=0.05)
if regressions:
print("Regressions detected:")
for variant, dims in regressions.items():
for dim, delta in dims.items():
print(f" {dim}: {delta:+.2f}")
else:
print("No regressions. Safe to deploy.")Regressions detected:
relevance: -0.17
This pattern is ideal for CI pipelines: run eval_regression() on every PR that touches persona definitions or guardrail configuration, and fail the build if quality drops.
Model Comparison with eval_suite()
When you want to know which model runs a persona best, eval_suite() does the heavy lifting. Give it a persona name and a list of models; it creates a bot variant for each model, runs every query through all of them, and returns combined results:
import talk_box as tb
results = tb.eval_suite(
"code_reviewer",
models=[
"anthropic:claude-sonnet-4-6",
"github:gpt-4o",
],
judge="anthropic:claude-sonnet-4-6",
)
results.to_great_table()anthropic:claude-sonnet-4-6: relevance: 0.92 | safety: 1.00 | instruction_adherence: 0.90
github:gpt-4o: relevance: 0.75 | safety: 1.00 | instruction_adherence: 0.68
Each variant in the results is named after its model string, so it’s immediately clear which model performed best on each dimension.
Persona default guards
By default, eval_suite() applies the persona’s default guards (if any are defined). To evaluate the raw model without guardrails, pass default_guards=False:
results = tb.eval_suite(
"financial_advisor",
models=["anthropic:claude-sonnet-4-6", "github:gpt-4o"],
default_guards=False,
judge="anthropic:claude-sonnet-4-6",
)GitHub Copilot models
Talk Box supports the github provider through chatlas’ ChatGithub, which gives you access to a wide range of models through your GitHub Copilot subscription. Use the github: prefix in model strings:
results = tb.eval_suite(
"code_reviewer",
models=[
"github:gpt-4o",
"github:o3-mini",
"github:claude-sonnet-4-6",
],
judge="anthropic:claude-sonnet-4-6",
)The github provider requires a GITHUB_TOKEN environment variable. If you have GitHub Copilot set up locally, this is typically your existing token.
Scorecard Export
After running an evaluation, you can export the results as a JSON scorecard for tracking over time, committing to your repository, or publishing to a docs site:
results.to_scorecard("scorecards/code_reviewer.json"){
"generated_at": "2026-05-12T14:43:41.283369+00:00",
"config": {
"persona": "code_reviewer",
"models": [
"anthropic:claude-sonnet-4-6",
"github:gpt-4o"
],
"type": "suite"
},
"variants": {
"anthropic:claude-sonnet-4-6": {
"dimensions": {
"relevance": 0.92,
"safety": 1.0,
"instruction_adherence": 0.9
},
"overall": 0.94,
"num_queries": 1
},
"github:gpt-4o": {
"dimensions": {
"relevance": 0.75,
"safety": 1.0,
"instruction_adherence": 0.68
},
"overall": 0.81,
"num_queries": 1
}
}
}
The eval_suite() function also accepts a scorecard_path parameter for one-shot evaluate-and-export:
results = tb.eval_suite(
"code_reviewer",
models=["anthropic:claude-sonnet-4-6", "github:gpt-4o"],
judge="anthropic:claude-sonnet-4-6",
scorecard_path="scorecards/code_reviewer.json",
)Using Persona Test Queries
Every persona ships with test_queries, which are representative questions that exercise the persona’s core capabilities. When you call eval() without explicit queries, it automatically uses these:
import talk_box as tb
# Uses code_reviewer's built-in test_queries
bot = tb.ChatBot().persona_pack("code_reviewer")
results = tb.eval(bot, judge="anthropic:claude-sonnet-4-6")
print(f"Evaluated {len(results)} queries")
print(f"Overall passed (>0.7): {results.passed()}")Evaluated 4 queries
Overall passed (>0.7): True
This makes it easy to run a quality check on any persona with zero configuration.
Reporting
EvalResults supports multiple output formats for analysis and presentation.
Summary Statistics
The .summary() method returns a dictionary with aggregate scores:
import talk_box as tb
results = tb.eval(bot, queries=["How do I sort a list?"])
summary = results.summary()
print(f"Variants: {summary['variants']}")
print(f"Dimensions: {summary['dimensions']}")
print(f"Overall scores: {summary['overall_scores']}")Great Tables Report
For visual comparison, .to_great_table() produces a formatted table showing mean scores per variant per dimension:
import talk_box as tb
results = tb.eval(
variants={"v1": bot_v1, "v2": bot_v2},
queries=queries,
)
results.to_great_table()DataFrame Export
For custom analysis, .to_dataframe() exports all results to a pandas DataFrame with one row per (variant, query, dimension) combination:
import talk_box as tb
df = results.to_dataframe()
# Group by variant, compute mean scores
print(df.groupby("variant")["score"].mean())The Pass/Fail Gate
The .passed() method checks whether all variants meet a minimum quality threshold. Use this in CI to gate deployments:
import talk_box as tb
bot = tb.ChatBot().persona_pack("customer_support_tier1")
results = tb.eval(bot, judge="anthropic:claude-sonnet-4-6")
assert results.passed(threshold=0.75), (
f"Bot failed quality gate: {results.summary()['overall_scores']}"
)Passed (threshold=0.75): True
Passed (threshold=0.95): False
A threshold of 0.75 is a reasonable default for most use cases. Raise it for high-stakes personas (financial, medical) and lower it when testing against smaller local models.
Public Scorecards
The scorecard_table() and sweep_table() functions turn scorecard and sweep JSON files into polished Great Tables with color-coded score cells, ready for embedding in docs, notebooks, or HTML reports.
Single-Persona Scorecard
After running eval_suite(), the .to_scorecard() output feeds directly into scorecard_table():
import talk_box as tb
# Run evaluation
results = tb.eval_suite(
"code_reviewer",
models=["anthropic:claude-sonnet-4-6", "github:gpt-4o"],
judge="anthropic:claude-sonnet-4-6",
scorecard_path="scorecards/code_reviewer.json",
)
# Render from the saved file
tb.scorecard_table("scorecards/code_reviewer.json")Or directly from the in-memory scorecard dict:
tb.scorecard_table(results.to_scorecard())| Scorecard: code_reviewer | |||||
| 2025-05-07 · Judge: anthropic:claude-sonnet-4-6 | |||||
| Dimensions | overall | queries | |||
|---|---|---|---|---|---|
| relevance | safety | instruction_adherence | |||
| anthropic:claude-sonnet-4-6 | 0.960 | 1.000 | 0.920 | 0.960 | 3 |
| github:gpt-4o | 0.950 | 1.000 | 0.900 | 0.950 | 3 |
Sweep Summary
After a multi-persona sweep (via make eval-sweep or run_eval_sweep.py), render the combined results with sweep_table():
import talk_box as tb
tb.sweep_table("scorecards/_sweeps/2025-05-07T12-00-00.json")| Eval Sweep Results | |||
| 2025-05-07 · 5/5 passed (threshold ≥ 0.7) · 120s · Judge: anthropic:claude-sonnet-4-6 | |||
| Models | status | ||
|---|---|---|---|
| anthropic:claude-sonnet-4-6 | github:gpt-4o | ||
| code_reviewer | 0.960 | 0.950 | PASS |
| financial_advisor | 1.000 | 0.990 | PASS |
| customer_support_tier1 | 0.970 | 0.960 | PASS |
| data_analyst | 0.980 | 0.970 | PASS |
| technical_writer | 0.980 | 0.770 | PASS |
Both functions accept either a file path or a dict, so they work seamlessly in scripts, notebooks, and Quarto documents. The tables use a red → yellow → green color scale on score columns, making it easy to spot weak spots at a glance.
Model Version Comparison
When a model provider releases a new version, use eval_model_update() to check whether your personas hold up. It runs the same persona against both model versions and flags any dimension that regresses beyond a threshold:
import talk_box as tb
results = tb.eval_model_update(
"code_reviewer",
before="anthropic:claude-sonnet-4-5",
after="anthropic:claude-sonnet-4-6",
judge="anthropic:claude-sonnet-4-6",
)
# Inspect regressions (dimensions that dropped > 5%)
drops = results.regressions()
if drops:
print("Regressions found:", drops)
else:
print("No regressions — safe to upgrade!")
# Visual comparison
results.to_great_table()No regressions — safe to upgrade!
You can also write the comparison to a scorecard for tracking over time:
results = tb.eval_model_update(
"financial_advisor",
before="anthropic:claude-sonnet-4-5",
after="anthropic:claude-sonnet-4-6",
judge="anthropic:claude-sonnet-4-6",
scorecard_path="scorecards/model_updates/financial_advisor_4-5_vs_4-6.json",
)
tb.scorecard_table(results.to_scorecard())