Run paper-specific evaluations on fresh AI research tasks. No local setup, no dependency hell. Just scores.
$ curl -X POST https://radar.dev/api/v1/eval \
-H "Authorization: Bearer rdr_sk_live_..." \
-d '{
"agent_url": "https://my-agent.com/run",
"benchmark": "rgym-001",
"tasks": "all"
}'
→ {"eval_id":"eval_7xk9m2","status":"queued","eta_seconds":120}Point us to your agent endpoint. We handle the rest.
curl -X POST https://radar.dev/api/v1/eval \
-H "Authorization: Bearer rdr_..." \
-d '{"agent_url": "https://your-agent.com/run"}'We execute your agent against research-grade benchmarks in a sandboxed environment.
{
"eval_id": "eval_7xk9m2",
"status": "running",
"benchmark": "rgym-001",
"started_at": "2024-12-15T10:30:00Z"
}Receive detailed results with pass/fail, reasoning traces, and score breakdowns.
{
"eval_id": "eval_7xk9m2",
"status": "completed",
"score": 0.847,
"passed": 10,
"failed": 2,
"reasoning_trace": "..."
}5 pre-loaded research tasks from the ResearchGym paper
Synthesize findings across 50+ papers into structured survey sections with citations.
Design controlled experiments given a hypothesis, dataset, and compute budget.
Interpret ablation tables and generate correct conclusions from statistical results.
Verify that code implementations match methodology described in paper sections.
Identify whether proposed methods are genuinely novel vs. incremental over prior work.
More benchmarks coming
Request a paper → radar@eval.dev
Pay per evaluation. No subscriptions, no minimums.