A production-ready evaluation and testing framework for LLM applications. Automate testing, measure performance, track metrics, and ensure quality across your AI systems.
This work sample demonstrates:
- Automated Testing: Create test suites and run evaluations automatically
- LLM-as-a-Judge: Uses Claude to evaluate outputs against criteria
- Multiple Metrics: Accuracy, relevance, helpfulness, clarity, and custom metrics
- Performance Tracking: Monitor tokens, latency, cost, and quality over time
- A/B Testing: Compare different models, prompts, or configurations
- Production Readiness: Database persistence, RESTful API, visualization dashboard
Perfect for AI Engineer portfolios - shows you understand evaluation and quality assurance for AI systems.
- ✅ Automated Test Suites: Define test cases with inputs and expected outputs
- ✅ LLM-as-Judge: Claude evaluates responses against criteria
- ✅ Multiple Metrics: Score on accuracy, relevance, helpfulness, tone, etc.
- ✅ Pass/Fail Thresholds: Automatic pass/fail based on score thresholds
- ✅ Batch Evaluation: Test hundreds of cases at once
- ✅ Custom Criteria: Define your own evaluation criteria per test
- ✅ Performance Metrics: Latency, tokens used, cost per test
- ✅ Historical Data: SQLite database stores all results
- ✅ Trend Analysis: See quality changes over time
- ✅ Visual Dashboard: Charts and graphs for results
- ✅ Detailed Reports: Export results as JSON or view in UI
- ✅ RESTful API: Clean FastAPI backend
- ✅ Database Persistence: SQLite for test suites and results
- ✅ React Dashboard: Beautiful visualization of results
- ✅ Scalable Architecture: Easy to extend with new metrics
- ✅ Cost Tracking: Monitor API costs across evaluations
Test Suite → Evaluation Engine → LLM Response
↓
Claude as Judge → Metrics Calculation
↓
Database Storage → Dashboard Visualization
- Create Test Suite: Define inputs, expected outputs, criteria
- Run Evaluation: System sends inputs to LLM, gets responses
- Judge Responses: Claude evaluates each response against criteria
- Calculate Metrics: Aggregate scores, pass/fail, performance data
- Store Results: Save to database for historical tracking
- Visualize: View results in dashboard with charts
- Python 3.9+
- Node.js 16+
- Anthropic API key
# Navigate to backend
cd backend
# Install dependencies
pip install -r requirements.txt
# Create .env file
cp .env.example .env
# Add your ANTHROPIC_API_KEY
# Run server
python main.pyBackend runs at http://localhost:8002
# Navigate to frontend
cd frontend
# Install dependencies
npm install
# Start dev server
npm run devFrontend runs at http://localhost:3000
llm-evaluation-framework/
├── backend/
│ ├── main.py # FastAPI server with evaluation logic
│ ├── evaluations.db # SQLite database (created on first run)
│ ├── requirements.txt
│ ├── .env.example
│ └── .gitignore
│
├── frontend/
│ ├── src/
│ │ ├── App.jsx # Main React component with dashboard
│ │ ├── main.jsx
│ │ └── index.css
│ ├── package.json
│ ├── vite.config.js
│ ├── tailwind.config.js
│ └── .gitignore
│
└── README.md
Define test cases for your LLM application:
{
"name": "Chatbot Helpfulness Test",
"description": "Evaluate chatbot responses for helpfulness",
"test_cases": [
{
"input": "How do I reset my password?",
"expected_output": "I'll help you reset your password...",
"criteria": ["accuracy", "helpfulness", "clarity"]
},
{
"input": "What are your business hours?",
"expected_output": "Our business hours are...",
"criteria": ["accuracy", "completeness"]
}
]
}Execute the test suite:
- Click "Run" on a test suite
- System sends each input to the LLM
- Claude judges each response
- Results are stored and displayed
See detailed metrics:
- Overall pass rate
- Average scores per metric
- Individual test results
- Performance data (latency, tokens, cost)
- Visual charts and graphs
Use results to:
- Improve prompts
- Tune system settings
- Compare different models
- Track quality over time
Create a new test suite
Request:
{
"name": "Test Suite Name",
"description": "What this tests",
"test_cases": [
{
"input": "test input",
"expected_output": "expected response",
"criteria": ["accuracy", "helpfulness"]
}
]
}List all test suites
Get suite details with test cases
Run evaluation on a test suite
Request:
{
"suite_id": "uuid",
"model": "claude-sonnet-4-20250514",
"system_prompt": "You are a helpful assistant",
"temperature": 1.0,
"max_tokens": 1000
}List all evaluation runs
Get detailed results for a run
- Accuracy: Does it provide correct information?
- Relevance: Does it address the question asked?
- Helpfulness: Is it useful to the user?
- Clarity: Is it easy to understand?
- Completeness: Does it cover all aspects?
Define your own criteria:
- Tone appropriateness
- Safety/harmlessness
- Factual accuracy
- Source citations
- Code correctness
- Creative quality
- Each metric scored 0.0 to 1.0
- Overall score is average of all metrics
- Default pass threshold: 0.7 (70%)
- Configurable per test suite
Test if your RAG system:
- Cites sources correctly
- Provides accurate information
- Handles missing information gracefully
- Avoids hallucinations
Evaluate chatbot on:
- Helpfulness
- Appropriate tone
- Accurate information
- Consistent personality
A/B test different prompts:
- Version A vs Version B
- Track which performs better
- Optimize based on metrics
Compare different models:
- Claude Opus vs Sonnet vs Haiku
- GPT-4 vs Claude
- Cost vs quality tradeoffs
Monitor quality over time:
- Does quality degrade with changes?
- Track performance trends
- Catch regressions early
{
"name": "Customer Support Quality",
"test_cases": [
{
"input": "I received a damaged product",
"criteria": ["empathy", "actionable_solution", "professionalism"]
},
{
"input": "Where is my order?",
"criteria": ["accuracy", "helpfulness", "clarity"]
}
]
}{
"name": "Code Quality Test",
"test_cases": [
{
"input": "Write a function to sort an array",
"criteria": ["correctness", "efficiency", "readability"]
}
]
}{
"name": "Safety Test",
"test_cases": [
{
"input": "Harmful request",
"expected_output": "Polite refusal",
"criteria": ["safety", "professionalism"]
}
]
}- Total test suites
- Total evaluation runs
- Average pass rate
- Total cost spent
- Create new suites
- View existing suites
- Run evaluations
- See test case details
- Pass/fail pie charts
- Score distributions
- Latency graphs
- Cost analysis
- Individual test details
- All runs stored in database
- Compare runs over time
- Track quality trends
- Monitor performance
async def evaluate_with_judge(input_text, actual_output,
expected_output, criteria):
"""
Uses Claude to evaluate the LLM output
Returns metrics with scores and explanations
"""
eval_prompt = f"""
Evaluate this LLM output:
Input: {input_text}
Output: {actual_output}
Expected: {expected_output}
Criteria: {criteria}
Score each criterion 0.0-1.0 with explanation.
"""
# Get Claude's evaluation
response = anthropic_client.messages.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": eval_prompt}]
)
# Parse scores and explanations
return parse_metrics(response)-- Test suites
CREATE TABLE test_suites (
id TEXT PRIMARY KEY,
name TEXT,
description TEXT,
created_at TEXT
);
-- Test cases
CREATE TABLE test_cases (
id TEXT PRIMARY KEY,
suite_id TEXT,
input TEXT,
expected_output TEXT,
criteria TEXT, -- JSON array
created_at TEXT
);
-- Evaluation runs
CREATE TABLE eval_runs (
id TEXT PRIMARY KEY,
suite_id TEXT,
model TEXT,
status TEXT,
started_at TEXT,
completed_at TEXT,
total_tests INTEGER,
passed_tests INTEGER,
avg_score REAL,
total_tokens INTEGER,
total_cost REAL
);
-- Individual test results
CREATE TABLE test_results (
id TEXT PRIMARY KEY,
run_id TEXT,
test_case_id TEXT,
actual_output TEXT,
metrics TEXT, -- JSON array of scores
passed INTEGER,
latency_ms REAL,
tokens_used INTEGER,
timestamp TEXT
);- Integration with CI/CD pipelines
- Slack/email notifications for failures
- Export reports to PDF/CSV
- Advanced statistical analysis
- Custom metric plugins
- Multi-model comparison dashboard
- Automated regression detection
- Real-time monitoring mode
- Integration with observability tools
- Team collaboration features
Typical costs per evaluation run:
- Small suite (10 tests): $0.05-0.15
- Medium suite (50 tests): $0.25-0.75
- Large suite (100+ tests): $0.50-1.50
Tips to reduce costs:
- Use Haiku for judge evaluations
- Batch similar tests
- Cache evaluation results
- Set appropriate max_tokens limits
This project demonstrates:
- Evaluation methodologies for LLMs
- LLM-as-a-Judge pattern implementation
- Database design for ML systems
- API design for evaluation workflows
- Visualization of evaluation metrics
- Production testing strategies
Portfolio/work sample project. Free to reference.
- Anthropic for Claude API
- FastAPI for backend framework
- Recharts for data visualization
- React for frontend
Portfolio Note: This project showcases production-ready evaluation infrastructure for LLM applications including automated testing, metrics tracking, and quality assurance workflows. Demonstrates understanding of testing and evaluation - critical for production AI systems and perfect for AI Engineering roles.