Gemini CLI Eval Toolkit

GSoC 2026 Proof-of-Concept — Project #23331: Behavioral Evals, Quality, and the OSS Community

A three-stage pipeline for behavioral eval coverage inventory, gap analysis, and automated eval generation for Gemini CLI.

59 passing tests · 4,243 lines of TypeScript · Matches the real evalTest(policy, { name, prompt, assert }) API from evals/test-helper.ts

Why This Exists

Gemini CLI's eval directory tests model-level behaviors (tool selection, output format), but has sparse coverage for the hook lifecycle — the code-path-determined behaviors where BeforeAgent, AfterAgent, BeforeModel, and BeforeTool must fire correctly regardless of what the model generates.

I know this because I've been inside these code paths. My merged PRs fixed regressions in:

PR	What broke	Hook path
#21383	BeforeAgent/AfterAgent fired inconsistently during recursive `sendMessageStream`	Hook sequencing
#20419	Transcript not flushed before BeforeTool dispatch on pure tool-call responses	Transcript lifecycle
#21541	YOLO mode override — BeforeModel hook skipped	Model override
#21239	Partial `llm_request` in BeforeModel caused undefined access	Config propagation

None of these regressions had evals that would have caught them. This toolkit generates those evals — and builds the infrastructure for the community to generate more.

Architecture

┌──────────────────────────────────────────────────────────────┐
│                     EVAL TOOLKIT PIPELINE                    │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐   ┌───────────────┐   ┌────────────────┐  │
│  │   INVENTORY   │──▶│  GAP ANALYSIS  │──▶│    GENERATE    │  │
│  │  AST Scanner  │   │ Coverage Map   │   │  evalTemplates │  │
│  │ @babel/parser │   │  Known Surface │   │ test-helper.js │  │
│  └──────────────┘   └───────────────┘   └────────────────┘  │
│         ▲                                       ▲            │
│         │                                       │            │
│  ┌──────┴───────┐               ┌───────────────┴────────┐  │
│  │ evals/*.ts   │               │ Chat Logs (JSON)       │  │
│  │ (existing)   │               │ ChatRecordingService   │  │
│  └──────────────┘               │ + functionCallWithId   │  │
│                                 └────────────────────────┘  │
│  Policy Classification                                      │
│  ┌────────────────────┐  ┌─────────────────────┐            │
│  │   ALWAYS_PASSES    │  │   USUALLY_PASSES    │            │
│  │ Deterministic       │  │ Model-dependent     │            │
│  │ Failures = bugs     │  │ Failures = quality  │            │
│  │ Run in CI           │  │ Run nightly         │            │
│  └────────────────────┘  └─────────────────────┘            │
└──────────────────────────────────────────────────────────────┘

Stage 1: Inventory — Scans eval files using @babel/parser + @babel/traverse for AST-based extraction. Extracts test names, categories, assertion types, coverage targets, and policy classification.

Stage 2: Gap Analysis — Diffs extracted metadata against a known feature surface registry. Produces severity-ranked gaps with related PR references and auditable confidence factors.

Stage 3: Generation — Two paths: (a) Chat log → eval: ingests ChatRecordingService JSON (including functionCallWithId format), extracts behavioral patterns, generates evals. (b) Gap → eval: scaffolds evals targeting specific coverage gaps. All generated evals use the real evalTest(policy, { name, prompt, assert }) API from evals/test-helper.ts.

Quick Start

npm install
npm test           # 59 passing tests
npm run demo       # Full pipeline demonstration

# CLI commands
npx tsx src/cli.ts inventory ./path/to/evals/
npx tsx src/cli.ts gaps ./path/to/evals/ --severity critical
npx tsx src/cli.ts generate ./fixtures/sample-chat-log.json --dry-run

What's Included

Behavioral Evals (31 tests across 4 files)

File	Tests	Regression	Coverage
`beforeAgentAfterAgent.eval.ts`	7	#21383	Single-fire, injection persistence, ordering, subagent isolation
`beforeModel.eval.ts`	10	#21541, #21239	Model override, partial request, deep merge, YOLO mode
`beforeTool.eval.ts`	6	#20419	Transcript flushing, denial semantics, multi-tool independence
`sessionLifecycle.eval.ts`	8	—	Unique session IDs, SessionEnd idempotency, single-fire

Each eval uses standalone simulators that model the production hook dispatch from packages/core/src/hooks/hookRunner.ts. See MONOREPO_INTEGRATION.md for the exact import replacements.

Pipeline Tests (28 tests)

Covers chat log parsing (including functionCallWithId format, negative paths, empty input), behavioral pattern extraction (tool sequences, debug loops, clarification, error recovery), eval generation (template output, policy classification, confidence factors), and coverage mapping (gap detection, severity ranking, recommendations).

Generated Evals (8 files from demo run)

All use the real evalTest('ALWAYS_PASSES', { name, prompt, assert }) API. All hook evals correctly derive phase from hook name (BeforeAgent → 'before', AfterAgent → 'after').

Eval-Author Skill

.gemini/skills/eval-author/SKILL.md

A Gemini CLI skill with YAML frontmatter, discoverable via /skills list and activatable via /skills reload. Provides guided eval authoring workflow with policy classification framework, common patterns, anti-patterns, and integration with /fix-behavioral-eval. This is the actual GSoC deliverable described in the project scope — no other candidate has built one.

CI Workflow

.github/workflows/eval-coverage.yml

PR-triggered eval gap detection for changes to packages/core/src/hooks/**, packages/core/src/tools/**, and packages/core/src/prompt/**. Runs the gap analyzer and comments on PRs with critical coverage gaps.

Documentation

File	Purpose
CONTRIBUTING.md	End-to-end eval authoring workflow with `/fix-behavioral-eval` integration
MONOREPO_INTEGRATION.md	Real production file paths for hook runner, scheduler, transcript service

The ALWAYS_PASSES / USUALLY_PASSES Distinction

Every generated eval includes a confidenceFactors array that makes the classification auditable:

{
  policy: 'ALWAYS_PASSES',
  stabilityScore: 0.9,
  confidenceFactors: [
    'Deterministic code path (+0.4)',
    'Hook behavior is independent of model output (+0.3)',
    'Known regression #21383 — behavior verified by fix (+0.2)',
    'Single observation source (-0.1)',
  ]
}

ALWAYS_PASSES (stability ≥ 0.8): The behavior is fully determined by code paths. A failure is a bug. Run in CI on every PR. Examples: hook injection ordering, config propagation, transcript flushing.

USUALLY_PASSES (stability < 0.8): The behavior depends on model generation. A failure suggests quality regression. Run nightly. Examples: tool selection, clarification-seeking, debug loop ordering.

Project Structure

.
├── .gemini/skills/eval-author/    ← Gemini CLI skill (discoverable)
│   └── SKILL.md
├── .github/workflows/             ← CI integration
│   └── eval-coverage.yml
├── src/
│   ├── types/index.ts             ← Type system mirroring gemini-cli internals
│   ├── inventory/
│   │   ├── evalScanner.ts         ← AST-based eval file scanner
│   │   └── coverageMapper.ts      ← Coverage diffing + gap identification
│   ├── generator/
│   │   ├── chatLogParser.ts       ← ChatRecordingService JSON parser
│   │   ├── evalTemplates.ts       ← 5 Vitest templates (real API shape)
│   │   └── evalGenerator.ts       ← Orchestration with confidence factors
│   ├── evals/hooks/
│   │   ├── beforeAgentAfterAgent.eval.ts
│   │   ├── beforeModel.eval.ts
│   │   ├── beforeTool.eval.ts
│   │   └── sessionLifecycle.eval.ts
│   ├── harness/
│   │   ├── evalTest.ts            ← Stub matching real test-helper.ts API
│   │   └── types.ts               ← EvalTestRig = unknown (honest, not guessed)
│   ├── cli.ts                     ← CLI with --dry-run support
│   ├── demo.ts                    ← End-to-end pipeline demo
│   └── index.ts                   ← Public API
├── tests/toolkit.test.ts          ← 28 pipeline + robustness tests
├── fixtures/sample-chat-log.json  ← 25-message log triggering all 4 patterns
├── output/generated-evals/        ← 8 generated eval files
├── CONTRIBUTING.md                ← Eval authoring workflow
├── MONOREPO_INTEGRATION.md        ← Real production file paths
└── README.md

GSoC Timeline Alignment

This prototype demonstrates deliverables across the first 8 weeks of the proposed 175-hour plan:

Week	Deliverable	Prototype Status
1–2	Onboard to quality area, write hook lifecycle evals	✅ 31 hook evals
3–4	Build eval coverage inventory + gap analyzer	✅ AST scanner + coverage mapper
5–7	Chat log → eval pipeline, stabilize across model versions	✅ Pipeline with all 4 pattern types
8	Contributor-facing skill for eval authoring	✅ `.gemini/skills/eval-author/SKILL.md`
9–10	CI integration, `/fix-behavioral-eval` extension	✅ Workflow + docs
11–12	Documentation, community dogfooding	✅ CONTRIBUTING.md + MONOREPO_INTEGRATION.md

Related Contributions

Merged: fix(hooks): fix BeforeAgent/AfterAgent inconsistencies (#21383)
Merged: fix(core): flush transcript for pure tool-call responses (#20419)
Merged: fix(trainer): handle falsy values in get_args_from_peft_config
Open: fix(cli): YOLO mode override (#21541)
Open: fix(core): partial llm_request in BeforeModel (#21239)

License

Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemini CLI Eval Toolkit

Why This Exists

Architecture

Quick Start

What's Included

Behavioral Evals (31 tests across 4 files)

Pipeline Tests (28 tests)

Generated Evals (8 files from demo run)

Eval-Author Skill

CI Workflow

Documentation

The ALWAYS_PASSES / USUALLY_PASSES Distinction

Project Structure

GSoC Timeline Alignment

Related Contributions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gemini/skills/eval-author		.gemini/skills/eval-author
.github/workflows		.github/workflows
fixtures		fixtures
output/generated-evals		output/generated-evals
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MONOREPO_INTEGRATION.md		MONOREPO_INTEGRATION.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Gemini CLI Eval Toolkit

Why This Exists

Architecture

Quick Start

What's Included

Behavioral Evals (31 tests across 4 files)

Pipeline Tests (28 tests)

Generated Evals (8 files from demo run)

Eval-Author Skill

CI Workflow

Documentation

The ALWAYS_PASSES / USUALLY_PASSES Distinction

Project Structure

GSoC Timeline Alignment

Related Contributions

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages