evalbuff: Codebuff SDK integration, direct LLM API, and quality improvements by jahooma · Pull Request #486 · CodebuffAI/codebuff

jahooma · 2026-03-30T18:20:18Z

Summary

Wire evalbuff to use the Codebuff SDK (CodebuffClient + CodebuffRunner) instead of spawning CLI processes, with trace storage as JSON lines of PrintModeEvent steps
Add base2-free-evals agent (free tier + noAskUser: true) for evalbuff runs
Replace Claude CLI spawning for prompt generation and failure analysis with direct Anthropic API calls via Vercel AI SDK (@ai-sdk/anthropic + ai), ~2-5x faster
Use Sonnet 4.6 (claude-sonnet-4-6) as default model for LLM calls, no maxOutputTokens or temperature overrides
Filter trivial commits (version bumps, merge commits) before expensive LLM calls
Support local git clone via hardlinks for near-instant repo setup
Use average scores (not median) for reviewer aggregation to handle model scoring biases
Default parallelism=5 for statistically meaningful score comparisons
Remove overfit pattern docs — next step is cross-validation approach for doc generation

Test plan

tsc --noEmit passes for evalbuff and root
Run evalbuff learn mode end-to-end and verify SDK-based agent execution
Verify prompt generation uses direct API (no CLI process spawning)

🤖 Generated with Claude Code

Replace CLI spawning with Codebuff SDK for agent execution and Vercel AI SDK for LLM calls (5x faster prompt generation). Add base2-free-evals agent with noAskUser. Use local git clones with hardlinks for near-instant repo setup. Filter trivial commits, use average reviewer scores, inline traces into doc writer prompts, and add adaptive improvement thresholds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Delete all docs/patterns/** files generated by evalbuff — they overfit to specific commits rather than teaching generalizable principles. Simplify compareScores now that parallelism is always 5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes RunState type mismatches: sessionState: null → undefined, fake session state objects cast as any, StreamStatus narrowing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jahooma and others added 8 commits March 29, 2026 16:18

evalbuff: add patterns/task-completion-validation.md (fde408c)

a5b5a2b

evalbuff: add patterns/template-literal-escaping.md (6d8bf39)

1d598f0

evalbuff: add patterns/task-scope-adherence.md (6d8bf39)

624c237

evalbuff: add patterns/task-scope-adherence.md (6d8bf39)

b62f461

evalbuff: add patterns/task-type-identification.md (fde408c)

c8da981

evalbuff: add patterns/implementation-validation.md (fde408c)

0596fdc

evalbuff: add patterns/existing-implementation-validation.md (fde408c)

694ae0b

jahooma requested review from brandonkachen and charleslien as code owners March 30, 2026 18:20

jahooma and others added 5 commits March 30, 2026 11:52

evalbuff: use sonnet 4.6, remove maxOutputTokens and temperature

6a7de4c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix model ID and clean up parallelism comments

af14ae3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into jahooma/evalbuff-quality

263937a

Fix type errors in send-message tests after merge from main

2c7b1b6

Fixes RunState type mismatches: sessionState: null → undefined, fake session state objects cast as any, StreamStatus narrowing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jahooma merged commit 1ca6b47 into main Mar 30, 2026
34 checks passed

jahooma deleted the jahooma/evalbuff-quality branch March 30, 2026 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evalbuff: Codebuff SDK integration, direct LLM API, and quality improvements#486

evalbuff: Codebuff SDK integration, direct LLM API, and quality improvements#486
jahooma merged 13 commits intomainfrom
jahooma/evalbuff-quality

jahooma commented Mar 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jahooma commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jahooma commented Mar 30, 2026 •

edited

Loading