Build software better, together

Giskard-AI / giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Mar 27, 2026
Python

PacificAI / langtest

Star

Deliver safe & effective language models

nlp artificial-intelligence benchmarks benchmark-framework model-assessment ai-safety mlops responsible-ai ml-safety trustworthy-ai ethics-in-ai ml-testing large-language-models llm ai-testing llm-test llm-evaluation-toolkit llm-as-evaluator llm-testing

Updated Mar 26, 2026
Python

Addepto / contextcheck

Star

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

open-source ci testing-tools chatbot-framework testing-framework chatbot-testing rag ai-chat large-language-models llm ai-testing llm-evaluation llm-evaluation-framework prompt-test llm-testing ai-testing-tool generative-ai-testing rag-testing summarization-testing

Updated Dec 11, 2024
Python

srvsngh99 / genai-testing-journey

Star

52-week journey from QA/SDET to GenAI Testing - learning in public with weekly mini-projects, code, and honest documentation of struggles and wins.

python test-automation qa-engineering learning-in-public prompt-engineering ai-testing genai llm-testing 52-week-challenge

Updated Mar 17, 2026
Python

kdunee / intentguard

Sponsor

Star

A Python library for verifying code properties using natural language assertions.

testing natural-language test-automation pytest unittest code-quality language-models code-verification llm ai-testing

Updated Mar 1, 2025
Python

hemangjoshi37a / claude-code-frontend-dev

Star

🚀 First multimodal AI-powered visual testing plugin for Claude Code. AI that can SEE your UI! 10x faster frontend development with closed-loop testing, browser automation, and Claude 4.5 Sonnet vision.

Updated Jan 25, 2026
Python

onerun-ai / onerun

Star

Open-source framework for stress-testing LLMs and conversational AI. Identify hallucinations, policy violations, and edge cases with scalable, realistic simulations. Join the discord: https://discord.gg/ssd4S37WNW

security ai simulation chatbot ai-agents ai-testing llm-testing chatbot-simulation

Updated Sep 15, 2025
Python

alepot55 / agentrial

Star

Statistical evaluation framework for AI agents

python testing ci-cd pytest confidence-intervals quality-assurance non-deterministic ai-agents mlops statistical-testing llm ai-testing llm-evaluation agent-evaluation

Updated Feb 6, 2026
Python

monkscode / Natural-Language-to-Robot-Framework

Star

Turn plain English into Robot Framework files with AI. No dependencies, no hassle — just validated, ready-to-run tests

python docker open-source natural-language-processing selenium test-automation quality-assurance robotframework automation-framework software-testing fastapi large-language-models generative-ai ai-testing agentic-framework llm-applications nlp-to-code

Updated Mar 29, 2026
Python

adhit-r / fairmind

Star

Ethical AI Governance Platform | Bias Detection | Compliance | Fairness Testing for ML, LLM & Multimodal AI | Open Source

Updated Mar 29, 2026
Python

jhd3197 / Prompture

Sponsor

Star

Prompture is an API-first library for requesting structured JSON output from LLMs (or any structure), validating it against a schema, and running comparative tests between models.

openai toon json-validation structured-output pydantic llm prompt-engineering ai-testing prompt-testing prompture

Updated Mar 13, 2026
Python

Harshit-J004 / toolguard

Star

The "Cloudflare for AI Agents". 6-layer security interceptor, real-time observability dashboard, and automated reliability testing for MCP and AI tool chains. Prevent hallucinations, prompt injection, and destructive tool calls.

Updated Mar 29, 2026
Python

ChiufungLee / RAG_TestCases_Generator

Star

AI 测试用例生成系统。基于 DeepSeek + 百炼部署的 RAG 知识库，包含需求分析、测试用例生成、智能运维助手、产品指南等内容

testcase testcase-generator rag ai-tools ai-testing ai-testing-tool ai-test-generator ai-test-case-generator

Updated Mar 28, 2026
Python

shivalimittal123 / astraforge.io

Star

Updated Dec 25, 2025
Python

nfodor / mcp-chromium-arm64

Star

🚀 ARM64 Browser Automation for Claude Code - SaaS testing on 80 Raspberry Pi budget. The first solution that works where Playwright/Puppeteer fail on ARM64. Autonomous testing without human debugging.

nodejs raspberry-pi mcp arm64 browser-automation startup-tools ai-testing claude-code saas-testing budget-ai

Updated Feb 25, 2026
Python

yukincom / llm-SugarScape

Star

Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in a grid world.This project aims to replicate a paper published in 2025 (arXiv:2508.12920).

python simulation alignment agent-based-modeling grok sugarscape aisafety llm ai-testing llm-eval llm-evaluation llm-testing grok-api xai-api

Updated Nov 28, 2025
Python

isagawa-qa / platform-selenium

Star

AI Execution Management for Test Automation — 5-layer Selenium architecture with self-building, self-improving enforcement via the Isagawa Kernel

python selenium test-automation qa-automation ai-testing claude-code ai-execution-management

Updated Mar 28, 2026
Python

sazed5055 / llmtest

Star

pytest for LLM apps - Test for grounding failures, prompt injection, safety violations, and regressions

python testing machine-learning ai ci-cd pytest developer-tools quality-assurance claude qa-automation llm prompt-engineering chatgpt ai-testing anthropic llm-testing llm-validation llmtesting ai-agent-testing

Updated Mar 18, 2026
Python

shyinlim / open_ai_with_pytest_simple_version

Star

Integration of OpenAI with Pytest to automate API test generation.

artificial-intelligence pytest openai api-testing software-testing automated-testing open-ai automation-testing ai-testing llm-agents ai-test-case-generator

Updated Jun 11, 2025
Python

superU-ai / voice-agent-QA

Star

A unified benchmarking framework for evaluating Voice AI agents across conversational quality, audio realism, latency metrics, and safety guardrails with scalable multi-language stress testing.

benchmarking text-to-speech webrtc stress-testing speech-recognition ai-safety conversational-ai vapi voice-ai latency-testing retell livekit speech-ai ai-testing llm-evaluation real-time-ai asr-tts qa-framework voice-ai-benchmarking

Updated Feb 26, 2026
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-testing

Here are 59 public repositories matching this topic...

Giskard-AI / giskard-oss

PacificAI / langtest

Addepto / contextcheck

srvsngh99 / genai-testing-journey

kdunee / intentguard

hemangjoshi37a / claude-code-frontend-dev

onerun-ai / onerun

alepot55 / agentrial

monkscode / Natural-Language-to-Robot-Framework

adhit-r / fairmind

jhd3197 / Prompture

Harshit-J004 / toolguard

ChiufungLee / RAG_TestCases_Generator

shivalimittal123 / astraforge.io

nfodor / mcp-chromium-arm64

yukincom / llm-SugarScape

isagawa-qa / platform-selenium

sazed5055 / llmtest

shyinlim / open_ai_with_pytest_simple_version

superU-ai / voice-agent-QA

Improve this page

Add this topic to your repo