
HazyResearch/data-centric-ai
📦 Open Source ProjectHazyResearch
A curated collection of research, papers, and resources focused on the data-centric AI paradigm.
The data-centric AI paradigm shifts the focus from model architecture engineering to the systematic improvement of data quality. This repository serves as a central hub for this movement, offering a structured list of seminal papers, technical resources, and best practices. It covers critical areas such as data labeling, data cleaning, synthetic data generation, and data debugging. By emphasizing the 'data-first' philosophy, the project helps developers understand how to build more robust, reliable, and performant AI systems. The repository is particularly valuable for those exploring how to handle noisy datasets, implement data programming, and leverage weak supervision to scale model training effectively. It acts as a living bibliography for the evolving field of data-centric machine learning.
💡Highlights
- ├─Curated data-centric research
- ├─Focus on data quality over models
- └─Covers weak supervision & cleaning
🎯For
- ├─Machine Learning Engineers
- ├─Data Scientists
- └─AI Researchers