Evaluations

Evals are end-to-end simulations that measure agent quality. Scenarios, simulated users, fake I/O channels (Email / SMS / Contacts / Computer), and the simulated clock all live in the external bedrock-harness repo. Bedrock stores the resulting traces, checkpoint spans, and reviews. Use evals to:

Catch regressions when you change prompts, tools, or models
Benchmark a new model or reasoning_effort against your current setup
Build a durable, reviewable history of agent quality over time

Concepts

Object	Description
Scenario	A scripted simulation — users, channels, scheduled events, and checkpoints. Defined in the harness repo (Python) and versioned with git.
Eval agent	A `purpose=eval` Agent created in Bedrock for each run. Tagged with the scenario name and the harness git SHA. Not scheduled by the `wake_agents` cron.
Trace	A top-level tracing record produced by one `run_agent` invocation. The harness creates one trace per scenario run.
Checkpoint span	A span with `span_type="checkpoint"`. Its `metadata` carries `{passed, description, simulated_time, assertions[]}` for each assertion boundary in the scenario.
TraceReview / TraceCriterionScore	A human (or automated) review over a trace: weighted rubric scores (1–5) and an auto-derived `checkpoint_pass_rate` from the checkpoint spans.

Run Lifecycle

Dispatch — You run harness eval <scenario> --local (or point at a remote Bedrock). The harness creates a purpose=eval Agent tagged with the scenario name and the harness commit SHA, then invokes run_agent which spawns the harness subprocess.
Simulate — Inside the subprocess, the harness plays scripted events on a simulated clock against fake I/O channels. Every LLM call, tool call, and checkpoint streams as a span into Bedrock over HTTP.
Score — Each checkpoint boundary writes a span with span_type="checkpoint" and metadata.passed. The harness process exits nonzero if any checkpoint failed.
Review — Humans (or automation) open the trace in the portal and attach a TraceReview with rubric scores + notes. The review’s checkpoint_pass_rate is derived from the checkpoint spans — you don’t score checkpoints again manually.

Running an Eval

Evals are driven by the harness CLI, not a Bedrock REST endpoint:

# from the bedrock-harness repo
harness eval curl-wget-light --local

This will:

Create a purpose=eval agent in Bedrock (with tags scenario:curl-wget-light + git-sha:<sha>)
Spawn a harness subprocess, hit run_agent, and stream the trace
Print the agent URL in the portal
Exit nonzero if any checkpoint fails

Point at a remote Bedrock by dropping --local and exporting BEDROCK_API_URL + BEDROCK_API_KEY (an org-scoped key).

Inspecting a Run

List traces for an eval agent:

curl "https://api.bedrock.orinlabs.org/api/tracing/traces/list/?agent=AGENT_ID" \
  -H "Authorization: Bearer YOUR_API_KEY"

Get the full trace with spans:

curl https://api.bedrock.orinlabs.org/api/tracing/traces/TRACE_ID/ \
  -H "Authorization: Bearer YOUR_API_KEY"

Filter to checkpoint spans to see assertion boundaries:

curl "https://api.bedrock.orinlabs.org/api/tracing/spans/?trace=TRACE_ID&span_type=checkpoint" \
  -H "Authorization: Bearer YOUR_API_KEY"

Each checkpoint span’s metadata looks like:

{
  "passed": true,
  "description": "Agent replied to first inbound message within 2 minutes",
  "simulated_time": "2026-01-02T10:02:00Z",
  "assertions": [
    {"passed": true, "description": "Reply sent"},
    {"passed": true, "description": "Reply arrived within 2 minutes"}
  ]
}

Reviews

A TraceReview attaches to any trace and captures rubric scores + notes. The checkpoint_pass_rate field is computed on read from the trace’s checkpoint spans — reviewers don’t manually score checkpoints. Create a review:

curl -X POST https://api.bedrock.orinlabs.org/api/tracing/traces/TRACE_ID/reviews/ \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "reviewer_name": "Alice",
    "status": "submitted",
    "overall_notes": "Agent responded promptly but missed follow-up timing.",
    "criterion_scores": [
      {
        "criterion_name": "responsiveness",
        "source": "base_rubric",
        "weight": 0.5,
        "score": 5,
        "justification": "Replied within 2 min"
      },
      {
        "criterion_name": "accuracy",
        "source": "base_rubric",
        "weight": 0.5,
        "score": 3,
        "justification": "Wrong timezone on reminder"
      }
    ]
  }'

Read back the review (includes computed weighted_score and checkpoint_pass_rate):

curl https://api.bedrock.orinlabs.org/api/tracing/reviews/REVIEW_ID/ \
  -H "Authorization: Bearer YOUR_API_KEY"

Update scores or flip status to submitted:

curl -X PATCH https://api.bedrock.orinlabs.org/api/tracing/reviews/REVIEW_ID/ \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "status": "submitted",
    "criterion_scores": [
      {"criterion_name": "responsiveness", "score": 4, "justification": "..."}
    ]
  }'

Score sources

TraceCriterionScore.source distinguishes rubric-derived vs ad-hoc criteria:

Source	Meaning
`base_rubric`	Part of the scenario’s rubric — feeds `TraceReview.weighted_score`.
`reviewer_added`	Ad-hoc criterion added during review — informational only, not rolled into `weighted_score`.

Reproducibility

Each eval run is anchored by:

The eval Agent, which carries git-sha:<harness_sha> as a tag and was created at a known wall-clock time
A HarnessRun row stamped with the harness subprocess’s resolved git SHA + origin + trace linkage
The trace’s spans, which include every LLM call, tool call, and checkpoint with simulated timestamps

Delete the eval agent to garbage-collect its traces; scenarios themselves live in the harness repo and are versioned there.

Scenarios Live in the Harness

Scenarios are Python files under harness/src/harness/evals/ in the bedrock-harness repo. There is no REST endpoint for listing or creating scenarios — contribute code to add one, and the harness CLI will pick it up.

API Summary

Method	Endpoint	Description
`GET`	`/api/tracing/traces/list/?agent=<id>`	List traces for an eval agent
`GET`	`/api/tracing/traces/{id}/`	Get a trace with spans
`GET`	`/api/tracing/spans/?trace=<id>&span_type=checkpoint`	Filter to checkpoint spans
`GET` / `POST`	`/api/tracing/traces/{trace_id}/reviews/`	List / create reviews on a trace
`GET` / `PATCH`	`/api/tracing/reviews/{review_id}/`	Get / update a review

​Evaluations

​Concepts

​Run Lifecycle

​Running an Eval

​Inspecting a Run

​Reviews

​Score sources

​Reproducibility

​Scenarios Live in the Harness

​API Summary