Skip to content

krishdef7/gemini-cli-eval-toolkit

Repository files navigation

Gemini CLI Eval Toolkit

GSoC 2026 Proof-of-Concept — Project #23331: Behavioral Evals, Quality, and the OSS Community

A three-stage pipeline for behavioral eval coverage inventory, gap analysis, and automated eval generation for Gemini CLI.

59 passing tests · 4,243 lines of TypeScript · Matches the real evalTest(policy, { name, prompt, assert }) API from evals/test-helper.ts


Why This Exists

Gemini CLI's eval directory tests model-level behaviors (tool selection, output format), but has sparse coverage for the hook lifecycle — the code-path-determined behaviors where BeforeAgent, AfterAgent, BeforeModel, and BeforeTool must fire correctly regardless of what the model generates.

I know this because I've been inside these code paths. My merged PRs fixed regressions in:

PR What broke Hook path
#21383 BeforeAgent/AfterAgent fired inconsistently during recursive sendMessageStream Hook sequencing
#20419 Transcript not flushed before BeforeTool dispatch on pure tool-call responses Transcript lifecycle
#21541 YOLO mode override — BeforeModel hook skipped Model override
#21239 Partial llm_request in BeforeModel caused undefined access Config propagation

None of these regressions had evals that would have caught them. This toolkit generates those evals — and builds the infrastructure for the community to generate more.


Architecture

┌──────────────────────────────────────────────────────────────┐
│                     EVAL TOOLKIT PIPELINE                    │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐   ┌───────────────┐   ┌────────────────┐  │
│  │   INVENTORY   │──▶│  GAP ANALYSIS  │──▶│    GENERATE    │  │
│  │  AST Scanner  │   │ Coverage Map   │   │  evalTemplates │  │
│  │ @babel/parser │   │  Known Surface │   │ test-helper.js │  │
│  └──────────────┘   └───────────────┘   └────────────────┘  │
│         ▲                                       ▲            │
│         │                                       │            │
│  ┌──────┴───────┐               ┌───────────────┴────────┐  │
│  │ evals/*.ts   │               │ Chat Logs (JSON)       │  │
│  │ (existing)   │               │ ChatRecordingService   │  │
│  └──────────────┘               │ + functionCallWithId   │  │
│                                 └────────────────────────┘  │
│  Policy Classification                                      │
│  ┌────────────────────┐  ┌─────────────────────┐            │
│  │   ALWAYS_PASSES    │  │   USUALLY_PASSES    │            │
│  │ Deterministic       │  │ Model-dependent     │            │
│  │ Failures = bugs     │  │ Failures = quality  │            │
│  │ Run in CI           │  │ Run nightly         │            │
│  └────────────────────┘  └─────────────────────┘            │
└──────────────────────────────────────────────────────────────┘

Stage 1: Inventory — Scans eval files using @babel/parser + @babel/traverse for AST-based extraction. Extracts test names, categories, assertion types, coverage targets, and policy classification.

Stage 2: Gap Analysis — Diffs extracted metadata against a known feature surface registry. Produces severity-ranked gaps with related PR references and auditable confidence factors.

Stage 3: Generation — Two paths: (a) Chat log → eval: ingests ChatRecordingService JSON (including functionCallWithId format), extracts behavioral patterns, generates evals. (b) Gap → eval: scaffolds evals targeting specific coverage gaps. All generated evals use the real evalTest(policy, { name, prompt, assert }) API from evals/test-helper.ts.


Quick Start

npm install
npm test           # 59 passing tests
npm run demo       # Full pipeline demonstration

# CLI commands
npx tsx src/cli.ts inventory ./path/to/evals/
npx tsx src/cli.ts gaps ./path/to/evals/ --severity critical
npx tsx src/cli.ts generate ./fixtures/sample-chat-log.json --dry-run

What's Included

Behavioral Evals (31 tests across 4 files)

File Tests Regression Coverage
beforeAgentAfterAgent.eval.ts 7 #21383 Single-fire, injection persistence, ordering, subagent isolation
beforeModel.eval.ts 10 #21541, #21239 Model override, partial request, deep merge, YOLO mode
beforeTool.eval.ts 6 #20419 Transcript flushing, denial semantics, multi-tool independence
sessionLifecycle.eval.ts 8 Unique session IDs, SessionEnd idempotency, single-fire

Each eval uses standalone simulators that model the production hook dispatch from packages/core/src/hooks/hookRunner.ts. See MONOREPO_INTEGRATION.md for the exact import replacements.

Pipeline Tests (28 tests)

Covers chat log parsing (including functionCallWithId format, negative paths, empty input), behavioral pattern extraction (tool sequences, debug loops, clarification, error recovery), eval generation (template output, policy classification, confidence factors), and coverage mapping (gap detection, severity ranking, recommendations).

Generated Evals (8 files from demo run)

All use the real evalTest('ALWAYS_PASSES', { name, prompt, assert }) API. All hook evals correctly derive phase from hook name (BeforeAgent → 'before', AfterAgent → 'after').

Eval-Author Skill

.gemini/skills/eval-author/SKILL.md

A Gemini CLI skill with YAML frontmatter, discoverable via /skills list and activatable via /skills reload. Provides guided eval authoring workflow with policy classification framework, common patterns, anti-patterns, and integration with /fix-behavioral-eval. This is the actual GSoC deliverable described in the project scope — no other candidate has built one.

CI Workflow

.github/workflows/eval-coverage.yml

PR-triggered eval gap detection for changes to packages/core/src/hooks/**, packages/core/src/tools/**, and packages/core/src/prompt/**. Runs the gap analyzer and comments on PRs with critical coverage gaps.

Documentation

File Purpose
CONTRIBUTING.md End-to-end eval authoring workflow with /fix-behavioral-eval integration
MONOREPO_INTEGRATION.md Real production file paths for hook runner, scheduler, transcript service

The ALWAYS_PASSES / USUALLY_PASSES Distinction

Every generated eval includes a confidenceFactors array that makes the classification auditable:

{
  policy: 'ALWAYS_PASSES',
  stabilityScore: 0.9,
  confidenceFactors: [
    'Deterministic code path (+0.4)',
    'Hook behavior is independent of model output (+0.3)',
    'Known regression #21383 — behavior verified by fix (+0.2)',
    'Single observation source (-0.1)',
  ]
}

ALWAYS_PASSES (stability ≥ 0.8): The behavior is fully determined by code paths. A failure is a bug. Run in CI on every PR. Examples: hook injection ordering, config propagation, transcript flushing.

USUALLY_PASSES (stability < 0.8): The behavior depends on model generation. A failure suggests quality regression. Run nightly. Examples: tool selection, clarification-seeking, debug loop ordering.


Project Structure

.
├── .gemini/skills/eval-author/    ← Gemini CLI skill (discoverable)
│   └── SKILL.md
├── .github/workflows/             ← CI integration
│   └── eval-coverage.yml
├── src/
│   ├── types/index.ts             ← Type system mirroring gemini-cli internals
│   ├── inventory/
│   │   ├── evalScanner.ts         ← AST-based eval file scanner
│   │   └── coverageMapper.ts      ← Coverage diffing + gap identification
│   ├── generator/
│   │   ├── chatLogParser.ts       ← ChatRecordingService JSON parser
│   │   ├── evalTemplates.ts       ← 5 Vitest templates (real API shape)
│   │   └── evalGenerator.ts       ← Orchestration with confidence factors
│   ├── evals/hooks/
│   │   ├── beforeAgentAfterAgent.eval.ts
│   │   ├── beforeModel.eval.ts
│   │   ├── beforeTool.eval.ts
│   │   └── sessionLifecycle.eval.ts
│   ├── harness/
│   │   ├── evalTest.ts            ← Stub matching real test-helper.ts API
│   │   └── types.ts               ← EvalTestRig = unknown (honest, not guessed)
│   ├── cli.ts                     ← CLI with --dry-run support
│   ├── demo.ts                    ← End-to-end pipeline demo
│   └── index.ts                   ← Public API
├── tests/toolkit.test.ts          ← 28 pipeline + robustness tests
├── fixtures/sample-chat-log.json  ← 25-message log triggering all 4 patterns
├── output/generated-evals/        ← 8 generated eval files
├── CONTRIBUTING.md                ← Eval authoring workflow
├── MONOREPO_INTEGRATION.md        ← Real production file paths
└── README.md

GSoC Timeline Alignment

This prototype demonstrates deliverables across the first 8 weeks of the proposed 175-hour plan:

Week Deliverable Prototype Status
1–2 Onboard to quality area, write hook lifecycle evals ✅ 31 hook evals
3–4 Build eval coverage inventory + gap analyzer ✅ AST scanner + coverage mapper
5–7 Chat log → eval pipeline, stabilize across model versions ✅ Pipeline with all 4 pattern types
8 Contributor-facing skill for eval authoring .gemini/skills/eval-author/SKILL.md
9–10 CI integration, /fix-behavioral-eval extension ✅ Workflow + docs
11–12 Documentation, community dogfooding ✅ CONTRIBUTING.md + MONOREPO_INTEGRATION.md

Related Contributions


License

Apache-2.0

About

Three-stage pipeline for behavioral eval coverage inventory, gap analysis, and automated eval generation for Gemini CLI - GSoC 2026 proposal for Project #23331

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors