Evaluations
Evals are end-to-end simulations that measure agent quality. Scenarios, simulated users, fake I/O channels (Email / SMS / Contacts / Computer), and the simulated clock all live in the external bedrock-harness repo. Bedrock stores the resulting traces, checkpoint spans, and reviews. Use evals to:- Catch regressions when you change prompts, tools, or models
- Benchmark a new model or
reasoning_effortagainst your current setup - Build a durable, reviewable history of agent quality over time
Concepts
| Object | Description |
|---|---|
| Scenario | A scripted simulation — users, channels, scheduled events, and checkpoints. Defined in the harness repo (Python) and versioned with git. |
| Eval agent | A purpose=eval Agent created in Bedrock for each run. Tagged with the scenario name and the harness git SHA. Not scheduled by the wake_agents cron. |
| Trace | A top-level tracing record produced by one run_agent invocation. The harness creates one trace per scenario run. |
| Checkpoint span | A span with span_type="checkpoint". Its metadata carries {passed, description, simulated_time, assertions[]} for each assertion boundary in the scenario. |
| TraceReview / TraceCriterionScore | A human (or automated) review over a trace: weighted rubric scores (1–5) and an auto-derived checkpoint_pass_rate from the checkpoint spans. |
Run Lifecycle
- Dispatch — You run
harness eval <scenario> --local(or point at a remote Bedrock). The harness creates apurpose=evalAgent tagged with the scenario name and the harness commit SHA, then invokesrun_agentwhich spawns the harness subprocess. - Simulate — Inside the subprocess, the harness plays scripted events on a simulated clock against fake I/O channels. Every LLM call, tool call, and checkpoint streams as a span into Bedrock over HTTP.
- Score — Each checkpoint boundary writes a span with
span_type="checkpoint"andmetadata.passed. The harness process exits nonzero if any checkpoint failed. - Review — Humans (or automation) open the trace in the portal and attach a
TraceReviewwith rubric scores + notes. The review’scheckpoint_pass_rateis derived from the checkpoint spans — you don’t score checkpoints again manually.
Running an Eval
Evals are driven by the harness CLI, not a Bedrock REST endpoint:- Create a
purpose=evalagent in Bedrock (with tagsscenario:curl-wget-light+git-sha:<sha>) - Spawn a harness subprocess, hit
run_agent, and stream the trace - Print the agent URL in the portal
- Exit nonzero if any checkpoint fails
--local and exporting BEDROCK_API_URL + BEDROCK_API_KEY (an org-scoped key).
Inspecting a Run
List traces for an eval agent:metadata looks like:
Reviews
ATraceReview attaches to any trace and captures rubric scores + notes. The checkpoint_pass_rate field is computed on read from the trace’s checkpoint spans — reviewers don’t manually score checkpoints.
Create a review:
weighted_score and checkpoint_pass_rate):
submitted:
Score sources
TraceCriterionScore.source distinguishes rubric-derived vs ad-hoc criteria:
| Source | Meaning |
|---|---|
base_rubric | Part of the scenario’s rubric — feeds TraceReview.weighted_score. |
reviewer_added | Ad-hoc criterion added during review — informational only, not rolled into weighted_score. |
Reproducibility
Each eval run is anchored by:- The eval Agent, which carries
git-sha:<harness_sha>as a tag and was created at a known wall-clock time - A HarnessRun row stamped with the harness subprocess’s resolved git SHA + origin + trace linkage
- The trace’s spans, which include every LLM call, tool call, and checkpoint with simulated timestamps
Scenarios Live in the Harness
Scenarios are Python files underharness/src/harness/evals/ in the bedrock-harness repo. There is no REST endpoint for listing or creating scenarios — contribute code to add one, and the harness CLI will pick it up.
API Summary
| Method | Endpoint | Description |
|---|---|---|
GET | /api/tracing/traces/list/?agent=<id> | List traces for an eval agent |
GET | /api/tracing/traces/{id}/ | Get a trace with spans |
GET | /api/tracing/spans/?trace=<id>&span_type=checkpoint | Filter to checkpoint spans |
GET / POST | /api/tracing/traces/{trace_id}/reviews/ | List / create reviews on a trace |
GET / PATCH | /api/tracing/reviews/{review_id}/ | Get / update a review |