Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
Updated
Jan 13, 2026 - Python
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
Enhances construction site safety using YOLO for object detection, identifying hazards like workers without helmets or safety vests, and proximity to machinery or vehicles. HDBSCAN clusters safety cone coordinates to create monitored zones. Post-processing algorithms improve detection accuracy.
Social Media Mining Toolkit (SMMT) main repository
A system for prompted weak supervision. Alfred is a powerful tool that leverages large language models to accelerate data annotation.
Data-centric AI building blocks for computer vision applications
Use Large Language Models like OpenAI's GPT-3.5 for data annotation and model enhancement. This framework combines human expertise with LLMs, employs Iterative Active Learning for continuous improvement, and integrates CleanLab (Confident Learning) to ensure high-quality datasets and better model performance
Lightweight self-hosted span annotation tool
A free and opensource yolov8, yolo11 and yolo26 all in one training tool that automates file structure and yaml files, auto labeling with SAM2, brush system for uninterupted labeling, a strong modular augmentation system where anybody can write their own filters and training. Without having to open terminal.
AnnoTheia is a data annotation toolkit that identifies when a person speaks in a scene and transcribes their speech, also offering flexibility to replace modules for different languages.
a tool for mapping free-text descriptions of entities to ontology terms
SuperAnnotate HTTP service for Generated Text Detection
Jaehyung Kim et al's ACL 2023 paper on "infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information"
The entry point for adapting, training, evaluating, and leveraging various Large Language Models (LLMs) for a wide range of Ukrainian NLP tasks.
A PointRCNN version of SAnE, which is a web-based semi-automatic annotation tool for point cloud data.
Simple Telegram bot to annotate and varify automatic speech recognition datasets
Annotate data using Jupyter notebooks
🧠 Multimodal Retrieval-Augmented Generation that "weaves" together text and images seamlessly. 🪡
Structured test tasks and model tuning scripts for multiple subjects from ZNO - the Ukrainian External Independent Evaluation (ЗНО)
Review, correct, and export ASR transcripts at scale. Web-based ASR accuracy workbench for reviewing, correcting, and exporting speech-to-text transcripts using Whisper, FFmpeg, and Flask.
Add a description, image, and links to the data-annotation topic page so that developers can more easily learn about it.
To associate your repository with the data-annotation topic, visit your repo's landing page and select "manage topics."