Skip to content

evalbuff: Codebuff SDK integration, direct LLM API, and quality improvements#486

Merged
jahooma merged 13 commits intomainfrom
jahooma/evalbuff-quality
Mar 30, 2026
Merged

evalbuff: Codebuff SDK integration, direct LLM API, and quality improvements#486
jahooma merged 13 commits intomainfrom
jahooma/evalbuff-quality

Conversation

@jahooma
Copy link
Copy Markdown
Contributor

@jahooma jahooma commented Mar 30, 2026

Summary

  • Wire evalbuff to use the Codebuff SDK (CodebuffClient + CodebuffRunner) instead of spawning CLI processes, with trace storage as JSON lines of PrintModeEvent steps
  • Add base2-free-evals agent (free tier + noAskUser: true) for evalbuff runs
  • Replace Claude CLI spawning for prompt generation and failure analysis with direct Anthropic API calls via Vercel AI SDK (@ai-sdk/anthropic + ai), ~2-5x faster
  • Use Sonnet 4.6 (claude-sonnet-4-6) as default model for LLM calls, no maxOutputTokens or temperature overrides
  • Filter trivial commits (version bumps, merge commits) before expensive LLM calls
  • Support local git clone via hardlinks for near-instant repo setup
  • Use average scores (not median) for reviewer aggregation to handle model scoring biases
  • Default parallelism=5 for statistically meaningful score comparisons
  • Remove overfit pattern docs — next step is cross-validation approach for doc generation

Test plan

  • tsc --noEmit passes for evalbuff and root
  • Run evalbuff learn mode end-to-end and verify SDK-based agent execution
  • Verify prompt generation uses direct API (no CLI process spawning)

🤖 Generated with Claude Code

jahooma and others added 8 commits March 29, 2026 16:18
Replace CLI spawning with Codebuff SDK for agent execution and Vercel AI SDK
for LLM calls (5x faster prompt generation). Add base2-free-evals agent with
noAskUser. Use local git clones with hardlinks for near-instant repo setup.
Filter trivial commits, use average reviewer scores, inline traces into doc
writer prompts, and add adaptive improvement thresholds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jahooma and others added 5 commits March 30, 2026 11:52
Delete all docs/patterns/** files generated by evalbuff — they overfit
to specific commits rather than teaching generalizable principles.
Simplify compareScores now that parallelism is always 5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes RunState type mismatches: sessionState: null → undefined,
fake session state objects cast as any, StreamStatus narrowing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jahooma jahooma merged commit 1ca6b47 into main Mar 30, 2026
34 checks passed
@jahooma jahooma deleted the jahooma/evalbuff-quality branch March 30, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant