GSoC 2026 Proof-of-Concept — Project #23331: Behavioral Evals, Quality, and the OSS Community
A three-stage pipeline for behavioral eval coverage inventory, gap analysis, and automated eval generation for Gemini CLI.
59 passing tests · 4,243 lines of TypeScript · Matches the real evalTest(policy, { name, prompt, assert }) API from evals/test-helper.ts
Gemini CLI's eval directory tests model-level behaviors (tool selection, output format), but has sparse coverage for the hook lifecycle — the code-path-determined behaviors where BeforeAgent, AfterAgent, BeforeModel, and BeforeTool must fire correctly regardless of what the model generates.
I know this because I've been inside these code paths. My merged PRs fixed regressions in:
| PR | What broke | Hook path |
|---|---|---|
| #21383 | BeforeAgent/AfterAgent fired inconsistently during recursive sendMessageStream |
Hook sequencing |
| #20419 | Transcript not flushed before BeforeTool dispatch on pure tool-call responses | Transcript lifecycle |
| #21541 | YOLO mode override — BeforeModel hook skipped | Model override |
| #21239 | Partial llm_request in BeforeModel caused undefined access |
Config propagation |
None of these regressions had evals that would have caught them. This toolkit generates those evals — and builds the infrastructure for the community to generate more.
┌──────────────────────────────────────────────────────────────┐
│ EVAL TOOLKIT PIPELINE │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ │
│ │ INVENTORY │──▶│ GAP ANALYSIS │──▶│ GENERATE │ │
│ │ AST Scanner │ │ Coverage Map │ │ evalTemplates │ │
│ │ @babel/parser │ │ Known Surface │ │ test-helper.js │ │
│ └──────────────┘ └───────────────┘ └────────────────┘ │
│ ▲ ▲ │
│ │ │ │
│ ┌──────┴───────┐ ┌───────────────┴────────┐ │
│ │ evals/*.ts │ │ Chat Logs (JSON) │ │
│ │ (existing) │ │ ChatRecordingService │ │
│ └──────────────┘ │ + functionCallWithId │ │
│ └────────────────────────┘ │
│ Policy Classification │
│ ┌────────────────────┐ ┌─────────────────────┐ │
│ │ ALWAYS_PASSES │ │ USUALLY_PASSES │ │
│ │ Deterministic │ │ Model-dependent │ │
│ │ Failures = bugs │ │ Failures = quality │ │
│ │ Run in CI │ │ Run nightly │ │
│ └────────────────────┘ └─────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Stage 1: Inventory — Scans eval files using @babel/parser + @babel/traverse for AST-based extraction. Extracts test names, categories, assertion types, coverage targets, and policy classification.
Stage 2: Gap Analysis — Diffs extracted metadata against a known feature surface registry. Produces severity-ranked gaps with related PR references and auditable confidence factors.
Stage 3: Generation — Two paths: (a) Chat log → eval: ingests ChatRecordingService JSON (including functionCallWithId format), extracts behavioral patterns, generates evals. (b) Gap → eval: scaffolds evals targeting specific coverage gaps. All generated evals use the real evalTest(policy, { name, prompt, assert }) API from evals/test-helper.ts.
npm install
npm test # 59 passing tests
npm run demo # Full pipeline demonstration
# CLI commands
npx tsx src/cli.ts inventory ./path/to/evals/
npx tsx src/cli.ts gaps ./path/to/evals/ --severity critical
npx tsx src/cli.ts generate ./fixtures/sample-chat-log.json --dry-run| File | Tests | Regression | Coverage |
|---|---|---|---|
beforeAgentAfterAgent.eval.ts |
7 | #21383 | Single-fire, injection persistence, ordering, subagent isolation |
beforeModel.eval.ts |
10 | #21541, #21239 | Model override, partial request, deep merge, YOLO mode |
beforeTool.eval.ts |
6 | #20419 | Transcript flushing, denial semantics, multi-tool independence |
sessionLifecycle.eval.ts |
8 | — | Unique session IDs, SessionEnd idempotency, single-fire |
Each eval uses standalone simulators that model the production hook dispatch from packages/core/src/hooks/hookRunner.ts. See MONOREPO_INTEGRATION.md for the exact import replacements.
Covers chat log parsing (including functionCallWithId format, negative paths, empty input), behavioral pattern extraction (tool sequences, debug loops, clarification, error recovery), eval generation (template output, policy classification, confidence factors), and coverage mapping (gap detection, severity ranking, recommendations).
All use the real evalTest('ALWAYS_PASSES', { name, prompt, assert }) API. All hook evals correctly derive phase from hook name (BeforeAgent → 'before', AfterAgent → 'after').
.gemini/skills/eval-author/SKILL.md
A Gemini CLI skill with YAML frontmatter, discoverable via /skills list and activatable via /skills reload. Provides guided eval authoring workflow with policy classification framework, common patterns, anti-patterns, and integration with /fix-behavioral-eval. This is the actual GSoC deliverable described in the project scope — no other candidate has built one.
.github/workflows/eval-coverage.yml
PR-triggered eval gap detection for changes to packages/core/src/hooks/**, packages/core/src/tools/**, and packages/core/src/prompt/**. Runs the gap analyzer and comments on PRs with critical coverage gaps.
| File | Purpose |
|---|---|
| CONTRIBUTING.md | End-to-end eval authoring workflow with /fix-behavioral-eval integration |
| MONOREPO_INTEGRATION.md | Real production file paths for hook runner, scheduler, transcript service |
Every generated eval includes a confidenceFactors array that makes the classification auditable:
{
policy: 'ALWAYS_PASSES',
stabilityScore: 0.9,
confidenceFactors: [
'Deterministic code path (+0.4)',
'Hook behavior is independent of model output (+0.3)',
'Known regression #21383 — behavior verified by fix (+0.2)',
'Single observation source (-0.1)',
]
}ALWAYS_PASSES (stability ≥ 0.8): The behavior is fully determined by code paths. A failure is a bug. Run in CI on every PR. Examples: hook injection ordering, config propagation, transcript flushing.
USUALLY_PASSES (stability < 0.8): The behavior depends on model generation. A failure suggests quality regression. Run nightly. Examples: tool selection, clarification-seeking, debug loop ordering.
.
├── .gemini/skills/eval-author/ ← Gemini CLI skill (discoverable)
│ └── SKILL.md
├── .github/workflows/ ← CI integration
│ └── eval-coverage.yml
├── src/
│ ├── types/index.ts ← Type system mirroring gemini-cli internals
│ ├── inventory/
│ │ ├── evalScanner.ts ← AST-based eval file scanner
│ │ └── coverageMapper.ts ← Coverage diffing + gap identification
│ ├── generator/
│ │ ├── chatLogParser.ts ← ChatRecordingService JSON parser
│ │ ├── evalTemplates.ts ← 5 Vitest templates (real API shape)
│ │ └── evalGenerator.ts ← Orchestration with confidence factors
│ ├── evals/hooks/
│ │ ├── beforeAgentAfterAgent.eval.ts
│ │ ├── beforeModel.eval.ts
│ │ ├── beforeTool.eval.ts
│ │ └── sessionLifecycle.eval.ts
│ ├── harness/
│ │ ├── evalTest.ts ← Stub matching real test-helper.ts API
│ │ └── types.ts ← EvalTestRig = unknown (honest, not guessed)
│ ├── cli.ts ← CLI with --dry-run support
│ ├── demo.ts ← End-to-end pipeline demo
│ └── index.ts ← Public API
├── tests/toolkit.test.ts ← 28 pipeline + robustness tests
├── fixtures/sample-chat-log.json ← 25-message log triggering all 4 patterns
├── output/generated-evals/ ← 8 generated eval files
├── CONTRIBUTING.md ← Eval authoring workflow
├── MONOREPO_INTEGRATION.md ← Real production file paths
└── README.md
This prototype demonstrates deliverables across the first 8 weeks of the proposed 175-hour plan:
| Week | Deliverable | Prototype Status |
|---|---|---|
| 1–2 | Onboard to quality area, write hook lifecycle evals | ✅ 31 hook evals |
| 3–4 | Build eval coverage inventory + gap analyzer | ✅ AST scanner + coverage mapper |
| 5–7 | Chat log → eval pipeline, stabilize across model versions | ✅ Pipeline with all 4 pattern types |
| 8 | Contributor-facing skill for eval authoring | ✅ .gemini/skills/eval-author/SKILL.md |
| 9–10 | CI integration, /fix-behavioral-eval extension |
✅ Workflow + docs |
| 11–12 | Documentation, community dogfooding | ✅ CONTRIBUTING.md + MONOREPO_INTEGRATION.md |
- Merged: fix(hooks): fix BeforeAgent/AfterAgent inconsistencies (#21383)
- Merged: fix(core): flush transcript for pure tool-call responses (#20419)
- Merged: fix(trainer): handle falsy values in get_args_from_peft_config
- Open: fix(cli): YOLO mode override (#21541)
- Open: fix(core): partial llm_request in BeforeModel (#21239)
Apache-2.0