A visual workbench for A/B testing AI skills. Upload two skill files, run them through a batch of test prompts, and let an AI judge score the results.
I built this to improve Anthropic's frontend design skill and prove the improvements with data. That project was hardcoded to frontend evaluation, so I extracted the core evaluation engine and made it extensible—now you can 1v1 any two skills, in any domain, with any model.
For the full story on how this started and why data-driven skill development matters, see Teaching Claude to Design Better.
This project is a spiritual successor to my work on Bloom GUI, a visual interface for Anthropic's behavioral evaluation framework. Building that taught me a lot about what makes evaluations useful—clear metrics, comparative analysis, and good visualization. I brought that learning (and the design language I developed there) into Skill Evaluator.
GUI by Justin Wetch
Configure — Upload two skill files (A and B) and set up your evaluation. Click the ✨ Generate button to have AI analyze your skills and create appropriate evaluation criteria and test prompts automatically. Choose your output type (text, visual, or both) based on what your skills produce.
Evaluate — Select your model and run all prompts through both skills simultaneously. Each prompt generates two outputs (one from Skill A, one from Skill B) which you can compare side-by-side. The status badges show progress in real-time.
Results — After generation, run the judge to score both outputs against your criteria. The summary view shows overall wins/losses, while the detailed breakdown reveals per-criterion scores and patterns.
Settings — Configure your Anthropic API key and set default models for generation and judging. Models can be overridden per-evaluation.
- Node.js 18+
- Anthropic API key (get one here)
git clone https://github.com/justinwetch/SkillEval.git
cd SkillEval/app
npm installnpm run devOpen http://localhost:5173 in your browser.
- Go to Settings and add your Anthropic API key
- Go to Configure and upload two skill files
- Click Generate All from Skills to auto-generate criteria and prompts
- Go to Evaluate and click Run All Evals
- Once generation completes, click Judge All to score the outputs
- Review the results in Summary and Detailed Breakdown tabs
Choose which model to evaluate your skills with. Different models may respond differently to the same skill instructions, so pick the one most relevant to your use case.
| Model | Characteristics |
|---|---|
| Sonnet 4.6 | Balanced capability and speed (default) |
| Opus 4.6 | Most capable, highest quality outputs (default judge) |
| Haiku 4.5 | Fast, cost-effective, good for iteration |
| Sonnet 4.5 | Legacy, still available |
| Opus 4.5 | Legacy, still available |
For judging, Opus 4.6 is recommended since it's the most capable model and provides the most nuanced scoring. Sonnet 4.6 is the default for generation. The 4.5 models remain available for comparison or cost-sensitive workflows.
For documentation on how to write effective skill files, see Claude Code Skills.
The test-skills/ folder contains example skill files you can use as reference:
sql-skill-a.md— Basic SQL query generation skillsql-skill-b.md— Advanced SQL skill with optimization focus
For visual evaluations (HTML/CSS skills that produce rendered output), you'll need the screenshot server:
# From the project root
node screenshot-server.jsThe server runs on port 3001 and uses Puppeteer to capture screenshots of rendered HTML. If you're evaluating text-only skills (code, SQL, writing, etc.), you don't need this.
SkillEval/
├── app/ # React application
│ ├── src/
│ │ ├── components/ # UI components
│ │ ├── contexts/ # React context (config, run state, settings)
│ │ ├── utils/ # Core logic (API, eval, judging)
│ │ └── views/ # Page components
│ └── package.json
├── test-skills/ # Example skill files
├── screenshots/ # GUI screenshots
├── screenshot-server.js # Optional visual evaluation server
├── ARCHITECTURE.md # Detailed architecture docs
└── DESIGN_SYSTEM.md # UI design system specs
Contributions welcome! Open an issue for bugs, feature requests, or to share interesting skill files you've tested.
MIT
Thanks to Anthropic for the Skills framework and Claude models.
Built by Justin Wetch





