Skip to main content

Evaluations

Evals are end-to-end simulations that measure agent quality. Scenarios, simulated users, fake I/O channels (Email / SMS / Contacts / Computer), and the simulated clock all live in the external bedrock-harness repo. Bedrock stores the resulting traces, checkpoint spans, and reviews. Use evals to:
  • Catch regressions when you change prompts, tools, or models
  • Benchmark a new model or reasoning_effort against your current setup
  • Build a durable, reviewable history of agent quality over time

Concepts

ObjectDescription
ScenarioA scripted simulation — users, channels, scheduled events, and checkpoints. Defined in the harness repo (Python) and versioned with git.
Eval agentA purpose=eval Agent created in Bedrock for each run. Tagged with the scenario name and the harness git SHA. Not scheduled by the wake_agents cron.
TraceA top-level tracing record produced by one run_agent invocation. The harness creates one trace per scenario run.
Checkpoint spanA span with span_type="checkpoint". Its metadata carries {passed, description, simulated_time, assertions[]} for each assertion boundary in the scenario.
TraceReview / TraceCriterionScoreA human (or automated) review over a trace: weighted rubric scores (1–5) and an auto-derived checkpoint_pass_rate from the checkpoint spans.

Run Lifecycle

  1. Dispatch — You run harness eval <scenario> --local (or point at a remote Bedrock). The harness creates a purpose=eval Agent tagged with the scenario name and the harness commit SHA, then invokes run_agent which spawns the harness subprocess.
  2. Simulate — Inside the subprocess, the harness plays scripted events on a simulated clock against fake I/O channels. Every LLM call, tool call, and checkpoint streams as a span into Bedrock over HTTP.
  3. Score — Each checkpoint boundary writes a span with span_type="checkpoint" and metadata.passed. The harness process exits nonzero if any checkpoint failed.
  4. Review — Humans (or automation) open the trace in the portal and attach a TraceReview with rubric scores + notes. The review’s checkpoint_pass_rate is derived from the checkpoint spans — you don’t score checkpoints again manually.

Running an Eval

Evals are driven by the harness CLI, not a Bedrock REST endpoint:
# from the bedrock-harness repo
harness eval curl-wget-light --local
This will:
  • Create a purpose=eval agent in Bedrock (with tags scenario:curl-wget-light + git-sha:<sha>)
  • Spawn a harness subprocess, hit run_agent, and stream the trace
  • Print the agent URL in the portal
  • Exit nonzero if any checkpoint fails
Point at a remote Bedrock by dropping --local and exporting BEDROCK_API_URL + BEDROCK_API_KEY (an org-scoped key).

Inspecting a Run

List traces for an eval agent:
curl "https://api.bedrock.orinlabs.org/api/tracing/traces/list/?agent=AGENT_ID" \
  -H "Authorization: Bearer YOUR_API_KEY"
Get the full trace with spans:
curl https://api.bedrock.orinlabs.org/api/tracing/traces/TRACE_ID/ \
  -H "Authorization: Bearer YOUR_API_KEY"
Filter to checkpoint spans to see assertion boundaries:
curl "https://api.bedrock.orinlabs.org/api/tracing/spans/?trace=TRACE_ID&span_type=checkpoint" \
  -H "Authorization: Bearer YOUR_API_KEY"
Each checkpoint span’s metadata looks like:
{
  "passed": true,
  "description": "Agent replied to first inbound message within 2 minutes",
  "simulated_time": "2026-01-02T10:02:00Z",
  "assertions": [
    {"passed": true, "description": "Reply sent"},
    {"passed": true, "description": "Reply arrived within 2 minutes"}
  ]
}

Reviews

A TraceReview attaches to any trace and captures rubric scores + notes. The checkpoint_pass_rate field is computed on read from the trace’s checkpoint spans — reviewers don’t manually score checkpoints. Create a review:
curl -X POST https://api.bedrock.orinlabs.org/api/tracing/traces/TRACE_ID/reviews/ \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "reviewer_name": "Alice",
    "status": "submitted",
    "overall_notes": "Agent responded promptly but missed follow-up timing.",
    "criterion_scores": [
      {
        "criterion_name": "responsiveness",
        "source": "base_rubric",
        "weight": 0.5,
        "score": 5,
        "justification": "Replied within 2 min"
      },
      {
        "criterion_name": "accuracy",
        "source": "base_rubric",
        "weight": 0.5,
        "score": 3,
        "justification": "Wrong timezone on reminder"
      }
    ]
  }'
Read back the review (includes computed weighted_score and checkpoint_pass_rate):
curl https://api.bedrock.orinlabs.org/api/tracing/reviews/REVIEW_ID/ \
  -H "Authorization: Bearer YOUR_API_KEY"
Update scores or flip status to submitted:
curl -X PATCH https://api.bedrock.orinlabs.org/api/tracing/reviews/REVIEW_ID/ \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "status": "submitted",
    "criterion_scores": [
      {"criterion_name": "responsiveness", "score": 4, "justification": "..."}
    ]
  }'

Score sources

TraceCriterionScore.source distinguishes rubric-derived vs ad-hoc criteria:
SourceMeaning
base_rubricPart of the scenario’s rubric — feeds TraceReview.weighted_score.
reviewer_addedAd-hoc criterion added during review — informational only, not rolled into weighted_score.

Reproducibility

Each eval run is anchored by:
  • The eval Agent, which carries git-sha:<harness_sha> as a tag and was created at a known wall-clock time
  • A HarnessRun row stamped with the harness subprocess’s resolved git SHA + origin + trace linkage
  • The trace’s spans, which include every LLM call, tool call, and checkpoint with simulated timestamps
Delete the eval agent to garbage-collect its traces; scenarios themselves live in the harness repo and are versioned there.

Scenarios Live in the Harness

Scenarios are Python files under harness/src/harness/evals/ in the bedrock-harness repo. There is no REST endpoint for listing or creating scenarios — contribute code to add one, and the harness CLI will pick it up.

API Summary

MethodEndpointDescription
GET/api/tracing/traces/list/?agent=<id>List traces for an eval agent
GET/api/tracing/traces/{id}/Get a trace with spans
GET/api/tracing/spans/?trace=<id>&span_type=checkpointFilter to checkpoint spans
GET / POST/api/tracing/traces/{trace_id}/reviews/List / create reviews on a trace
GET / PATCH/api/tracing/reviews/{review_id}/Get / update a review