ai-evaluation-framework

Here are 20 public repositories matching this topic...

Vvkmnn / awesome-ai-eval

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications

Updated Mar 25, 2026

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Mar 23, 2026
Python

firstlinesoftware / eval-ai-library

Star

Comprehensive AI Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Mar 3, 2026
Python

hparreao / Awesome-AI-Evaluation-Guide

Star

A comprehensive, implementation-focused guide to evaluating Large Language Models, RAG systems, and Agentic AI in production environments.

awesome gpt evaluation-metrics evaluation-framework awesome-lists claude ai-evaluation large-language-models llm agentic-ai ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Dec 5, 2025

SS47816 / AGI-Elo

Star

[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?

benchmark leaderboard agi imagenet coco artificial-general-intelligence datasets evaluation-metrics elo-rating rating-system evaluation-framework sota ai-benchmarks waymo-open-dataset mmlu vision-language-action ai-evaluation-framework livecodebench navsim

Updated Oct 28, 2025
Python

justindobbs / Tracecore

Star

Deterministic runtime for agent evaluation

reliability-engineering specification ai-agents benchmarking-framework autogen fastapi langchain observability-platform ai-evaluation-framework agent-testing agent-benchmark deterministic-testing autoresearch

Updated Mar 25, 2026
Python

karloks2005 / JailbreakLab

Star

Test and evaluate Large Language Models against prompt injections, jailbreaks, and adversarial attacks with a web-based interactive lab.

react docker kubernetes jailbreak model-alignment machine-learning-security ai-security fastapi huggingface prompt-injection llm-security llm-safety security-research-tool ai-evaluation-framework adversarial-ai prompt-defense llm-red-teaming

Updated Mar 27, 2026
Python

syamsasi99 / prompt-evaluator

Star

prompt-evaluator is an open-source toolkit for evaluating, testing, and comparing LLM prompts. It provides a GUI-driven workflow for running prompt tests, tracking token usage, visualizing results, and ensuring reliability across models like OpenAI, Claude, and Gemini.

electron react typescript datascience developer-tools ai-evaluation llm prompt-engineering prompt-testing promptfoo ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Dec 4, 2025
TypeScript

AGBAJEMUH / Awesome-AI-Evaluation-Guide

Star

🤖 Evaluate AI systems effectively with our comprehensive guide to methods, tools, and frameworks for assessing Large Language Models and agents.

awesome gpt evaluation-metrics evaluation-framework claude ai-evaluation large-language-models llm agentic-ai ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Feb 24, 2026

provnai / vex-halt

Sponsor

Star

VEX-HALT — Benchmark suite for AI verification systems. 443+ tests for calibration, robustness, honesty, and proof integrity.

testing rust benchmark cryptography ai merkle testing-tools ai-evaluation llm-as-judge ai-evaluation-framework

Updated Dec 23, 2025
Rust

lalitkpal / VerifyAI

Star

VerifyAI is a simple UI application to test GenAI outputs

ai-evaluation llm generative-ai genai llm-test llm-evaluation llm-evaluation-framework llm-evaluation-metrics llm-testing ai-metrics ai-evaluation-framework generative-ai-evaluation

Updated Sep 5, 2025
Python

mbayers6370 / ALIGN-framework

Star

Multi-dimensional evaluation of AI responses using semantic alignment, conversational flow, and engagement metrics.

human-in-the-loop emotional-analysis contextual-ai llm-evaluation emotional-alignment ai-evaluation-framework

Updated Oct 29, 2025
Python

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

alyssadata / Driftmap-Public-Harness_llm-eval-harness-lite

Star

Public Driftmap harness: public-safe CSV suites + rubrics + run logs for drift detection, refusal integrity, injection resistance, and uncertainty tracking.

benchmark-framework ai-framework ai-safety drift-detection ai-agent ai-evaluation red-teaming-tools ai-agents-framework llm-evaluation refusal llm-evaluation-framework ai-agent-tools ai-evaluation-framework

Updated Mar 2, 2026
Python

joshualamerton / agent-evaluation-lab

Star

Sandbox platform for testing and evaluating autonomous agents

ai developer-tools ai-agents ai-agent agent-simulator ai-evaluation llm ai-testing large-language-model ai-evaluation-framework agent-simulation ai-sandbox

Updated Mar 15, 2026
Python

LungleyM / ks-school-leader-governance

Star

Structural Reliability Evaluation Report and Supporting Artefacts

ai-governance crep llm-governance governance-framework ai-evaluation-framework prompt-architecture human-in-the-loop-ai decision-systems

Updated Feb 25, 2026

SamiMelhem / CustomBench

Star

Web app & CLI for benchmarking LLMs via OpenRouter. Test multiple models simultaneously with custom benchmarks, live progress tracking, and detailed results analysis.

bun typescript-react ai-evaluation openrouter llm-benchmarking ai-evaluation-framework nextjs16-typescript

Updated Mar 2, 2026
TypeScript

MirrorLoop / mirrorloop-core

Star

Official public release of MirrorLoop Core (v1.3 – April 2025)

framework ai gpt language-model semantic-testing llm-testing recursive-ai ai-evaluation-framework mirrorloop

Updated Apr 2, 2025

ZhaoJackson / PsyChat

Star

Clinical trial application for mental health benchmark evaluation of AI responses in multi-turn conversations. Guides users to understand AI interaction patterns and resolve personal mental health issues through therapeutic AI assistance.

sentiment-analysis meteor clinical-trials rouge mental-health bleu-score ethical streamlit bert-fine-tuning azure-openai ai-evaluation-framework benchmark-evaluation-llms multi-turn-conversations

Updated Oct 23, 2025
Python

aaddii09 / llm-eval-harness

Star

🔍 Run efficient evaluations for prompt and LLM regression testing with this lightweight, secret-free evaluation harness.

Updated Mar 29, 2026
Python

Improve this page

Add a description, image, and links to the ai-evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evaluation-framework

Here are 20 public repositories matching this topic...

Vvkmnn / awesome-ai-eval

meshkovQA / Eval-ai-library

firstlinesoftware / eval-ai-library

hparreao / Awesome-AI-Evaluation-Guide

SS47816 / AGI-Elo

justindobbs / Tracecore

karloks2005 / JailbreakLab

syamsasi99 / prompt-evaluator

AGBAJEMUH / Awesome-AI-Evaluation-Guide

provnai / vex-halt

lalitkpal / VerifyAI

mbayers6370 / ALIGN-framework

PabloCabaleiro / pondera

alyssadata / Driftmap-Public-Harness_llm-eval-harness-lite

joshualamerton / agent-evaluation-lab

LungleyM / ks-school-leader-governance

SamiMelhem / CustomBench

MirrorLoop / mirrorloop-core

ZhaoJackson / PsyChat

aaddii09 / llm-eval-harness

Improve this page

Add this topic to your repo