Now in public beta

One API call to benchmark
your research agent

Run paper-specific evaluations on fresh AI research tasks. No local setup, no dependency hell. Just scores.

terminal

$ curl -X POST https://radar.dev/api/v1/eval \
    -H "Authorization: Bearer rdr_sk_live_..." \
    -d '{
      "agent_url": "https://my-agent.com/run",
      "benchmark": "rgym-001",
      "tasks": "all"
    }'

→ {"eval_id":"eval_7xk9m2","status":"queued","eta_seconds":120}

How it works

STEP 01

Upload Agent

Point us to your agent endpoint. We handle the rest.

curl -X POST https://radar.dev/api/v1/eval \
  -H "Authorization: Bearer rdr_..." \
  -d '{"agent_url": "https://your-agent.com/run"}'

STEP 02

Run Eval

We execute your agent against research-grade benchmarks in a sandboxed environment.

{
  "eval_id": "eval_7xk9m2",
  "status": "running",
  "benchmark": "rgym-001",
  "started_at": "2024-12-15T10:30:00Z"
}

STEP 03

Get Scores

Receive detailed results with pass/fail, reasoning traces, and score breakdowns.

{
  "eval_id": "eval_7xk9m2",
  "status": "completed",
  "score": 0.847,
  "passed": 10,
  "failed": 2,
  "reasoning_trace": "..."
}

Available Benchmarks

5 pre-loaded research tasks from the ResearchGym paper

rgym-001Hard

Literature Survey Synthesis

Synthesize findings across 50+ papers into structured survey sections with citations.

12 tasks · ResearchGym (2024)

rgym-002Expert

Experiment Design

Design controlled experiments given a hypothesis, dataset, and compute budget.

8 tasks · ResearchGym (2024)

rgym-003Medium

Result Interpretation

Interpret ablation tables and generate correct conclusions from statistical results.

15 tasks · ResearchGym (2024)

rgym-004Hard

Code-to-Paper Alignment

Verify that code implementations match methodology described in paper sections.

10 tasks · ResearchGym (2024)

rgym-005Expert

Novelty Detection

Identify whether proposed methods are genuinely novel vs. incremental over prior work.

6 tasks · ResearchGym (2024)

More benchmarks coming

Request a paper → radar@eval.dev

Simple, usage-based pricing

Pay per evaluation. No subscriptions, no minimums.

$49

100 evals

$0.49/eval

$199

500 evals

$0.40/eval

$349

1,000 evals

$0.35/eval

View full pricing details →

One API call to benchmarkyour research agent