> AI agent evaluation framework

Evaluate agents with
terminal precision

No server. No signup. Multi-objective scoring from YAML specs. Deterministic code judges + customizable LLM judges, version-controlled in Git.

Get Started GitHub

agentv

$ agentv eval ./evals/math.yaml

Running 3 tests...

PASS addition score: 1.0

PASS multiplication score: 1.0

FAIL division score: 0.4

Results: 2 passed 1 failed

$ agentv compare run-a run-b

Comparing 2 runs...

correctness +12.5% (0.72 -> 0.81)

latency -340ms (1.2s -> 0.86s)

cost +$0.02 ($0.05 -> $0.07)

Overall: improved

Built for your workflow

Local Execution

No cloud dependency. All data stays on your machine. Zero overhead to get started.

[~]

Multi-Objective Scoring

Correctness, latency, cost, and safety measured in a single evaluation run.

{f}

Code + LLM Judges

Deterministic code validators and customizable LLM judges, composable and extensible.

LLM & Agent Targets

Direct LLM providers plus Claude Code, Codex, Pi, Copilot, OpenCode agent targets.

Rubric Grading

Structured criteria with weights and auto-generation. Google ADK-style object rubrics.

<=>

A/B Comparison

Compare evaluation runs side-by-side with statistical deltas and regression detection.

Quick Start

Install

npm install -g agentv

Initialize

agentv init

Configure

Copy .env.example to .env and add your API keys.

Create an eval

description: Math evaluation
execution:
  target: default

tests:
  - id: addition
    criteria: Correctly calculates 15 + 27 = 42
    input: What is 15 + 27?

Run

agentv eval ./evals/example.yaml

Built for the AI Agent Lifecycle

Layer	Tool	When	What it does
Evaluate	AgentV	Pre-production	Score agents, detect regressions, gate CI/CD
Govern	Agent Control	Runtime	Enforce policies on agent actions
Observe	Langfuse	Runtime	Trace execution, monitor production

Evaluate agents withterminal precision