[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
-
Updated
Oct 30, 2025 - Python
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
An agent evaluation framework with native multi-turn feedback iteration.
CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming
🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.
Open-source multi-agent AI debate arena: pit Claude, GPT, Gemini, Ollama & HuggingFace models against each other with frozen-context fairness, evidence-first judging, 20+ personas, code review, and PDF/Markdown reports. CLI + Web UI.
Testing how well LLMs can solve jigsaw puzzles
Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).
Comprehensive benchmark of OpenRouter free-tier LLMs for practical applications. Evaluates models for coding, Thai language, and general use.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
LiveSecBench(大模型动态安全测评基准)是大模型安全领域的专业、动态、多维度测评基准。我们致力于通过科学、系统、持续演进的测评体系,客观评估与衡量大模型的安全性能,推动大模型技术向更安全、更可靠、更负责任的方向发展,为产业落地和学术研究提供关键的安全标尺。
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams
Claude Code skill that pits Claude, ChatGPT, and Gemini against each other, then lets them cross-judge each other blind
A decentralized, adversarial + dynamic AI evaluation protocol on Bittensor. Combats benchmark saturation by measuring genuine intelligence through dynamic, zero-shot generalization tasks.
A minimal constitutional law for tool-using AI agents centered on human dignity, clear agency, and revocable oversight.
Pick Your LLM: Intelligent, Use-Case Aware LLM Model advisor for Optimal Performance and Cost
Yes, LLM's just regurgitate the same jokes on the internet over and over again. But some are slightly funnier than others.
is it better to run a Tiny Model (2B-4B) at High Precision (FP16/INT8), or a Large Model (8B+) at Low Precision (INT4)?" This benchmark framework allows developers to scientifically choose the best model for resource-constrained environments (consumer GPUs, laptops, edge devices) by measuring the trade-off between Speed and Intelligence
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."