Datasets
Access comprehensive, FAIR-compliant data generated to meet the Grand Challenges and advance the future of health and behavior research.
AI/ML for Clinical Care CHoRUS
CHoRUS Dataset
Coming Soon
About the Dataset
The CHoRUS project is developing a flagship dataset to support AI/ML research focused on team-based clinical care. The dataset will be released in the future and is designed to support the development of responsible, real-world AI tools that enhance healthcare delivery.
Functional Genomics CM4AI
Cell Maps for Artificial Intelligence (CM4AI Dataset)
March 2025 Beta Release
About the Dataset
The CM4AI dataset delivers rich, multimodal cellular data designed to support AI research in precision medicine and drug response.
This Beta release includes:
- Perturb-seq data in undifferentiated KOLF2.1J iPSCs
- SEC-MS data in undifferentiated KOLF2.1J iPSCs and iPSC-derived NPCs, neurons, and cardiomyocytes
- Immunofluorescence images in MDA-MB-468 breast cancer cells with and without chemotherapy (vorinostat and paclitaxel)
CM4AI datasets are packaged in RO-Crate format using the FAIRSCAPE framework, ensuring AI-readiness, traceable provenance, and rich metadata. Data will be continuously augmented through the end of the project.
Precision Public Health / Voice
Bridge2AI-Voice Dataset
Version 2.0 Release
About the Dataset
The Bridge2AI-Voice dataset explores the power of voice as a non-invasive, scalable biomarker linked to a wide range of health conditions—including neurological, mood, respiratory, and voice disorders. Designed to support responsible AI research, this ethically sourced dataset combines voice-derived features with detailed clinical and demographic data.
Version 2.0 includes:
- 19,271 recordings from 442 participants across five North American sites
- Derived voice features, including spectrograms (original recordings excluded for privacy)
- Rich clinical data, demographics, and validated questionnaire responses
Participants were selected for conditions known to affect vocal characteristics, enabling researchers to explore meaningful links between acoustic markers and health status. This dataset is ideal for advancing AI models in diagnostics, monitoring, and digital health.
Salutogenesis / AI-READI
Flagship Dataset of Type 2 Diabetes from the AI-READI Project
Version 2.0.0 Release
About the Dataset
The Artificial Intelligence Ready and Exploratory Atlas for Diabetes Insights (AI-READI) project aims to revolutionize how we understand and treat type 2 diabetes mellitus (T2DM) through ethically sourced, AI-optimized data. By assembling one of the most comprehensive multimodal datasets of its kind, AI-READI supports cutting-edge research into disease progression, recovery, and health-promoting (salutogenic) pathways.
Version 2.0.0 includes data from 1,067 participants, collected between July 19, 2023 and July 31, 2024. This initial release is part of a larger effort to build a cross-sectional dataset of 4,000 individuals with longitudinal follow-up planned for 10% of the cohort. The study population is balanced across diabetes stages, from healthy individuals to those with insulin-dependent T2DM.
Collected data spans multiple biological, physiological, and behavioral modalities and is designed to support pseudo-time manifold analysis, enabling researchers to reconstruct disease trajectories and identify opportunities for intervention.
Key features:
- 1,067 participants
- 165,051 files (~2.01 TB total)
- Multimodal, de-identified data (PHI removed)
- No information on sex, race/ethnicity, or medications included in this release
This dataset is built to advance AI/ML research while minimizing bias and enhancing reproducibility. Future versions will continue to expand coverage, diversity, and depth.