The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
Updated
Apr 10, 2025 - Python
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Refine high-quality datasets and visual AI models
A Doctor for your data
fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
[ICLR 2025] Improving Data Efficiency via Curating LLM-Driven Rating Systems
Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!
Client interface to Cleanlab Studio and the Trustworthy Language Model
🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation (EMNLP 2023)
Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.
Rebalancing chemical reaction
AqSolDB: A curated aqueous solubility dataset contains 9.982 unique compounds.
Data Cleaning and Data Profiling Library for Python
Reaction data exploration: a map of reagents with regions of similar reagent purpose.
tranSMART Arborist ETL toolkit
HISDAC-ES: Creating historical settlement data for Spain (1900-2020) based on cadastral building footprint data
Archaeological Map of the Czech Republic (AMCR)
Add a description, image, and links to the data-curation topic page so that developers can more easily learn about it.
To associate your repository with the data-curation topic, visit your repo's landing page and select "manage topics."