BRIDGE2AI

Datasets

Datasets

Access comprehensive, FAIR-compliant data generated to meet the Grand Challenges and advance the future of health and behavior research.

AI/ML for Clinical Care CHoRUS Dataset

About the Dataset

The CHoRUS project is developing a flagship dataset to support AI/ML research focused on team-based clinical care. The dataset is designed to support the development of responsible, real-world AI tools that enhance healthcare delivery. This dataset is available in the Mass General Brigham (MGB) Azure Enclave under registered access and participation in challenges and programs. 

For opportunities to interact with the CHoRUS dataset, please visit our Events page.

  • Future releases will include:
    • Non-medical human factors
    • Waveform EEG
    • Radiology Images
  • Over 50,000 ICU admissions from 14 hospitals around the United States that include patients with AKI, Shock, Sepsis, Trauma and more
  • 1.6B rows of OMOP-standardized EHR data
  • 7,642 patients with Radiology Data 
  • 23 TB of Waveform Data

Functional Genomics Cell Maps for Artificial Intelligence (CM4AI) Dataset

About the Dataset

The CM4AI dataset delivers rich, multimodal cellular data designed to support AI research in precision medicine and drug response. CM4AI datasets are packaged in RO-Crate format using the FAIRSCAPE framework, ensuring AI-readiness, traceable provenance, and rich metadata. Data will be continuously augmented through the end of the project.

  • Future releases will include:
    • AP-MS interactomes for MDA-MB-468 triple negative breast cancer (TNBC) cells
    • IF images and AP-MS interactomes for undifferentiated (parental) iPSCs
    • Perturb-seq data, IF images, and AP-MS interactomes for iPSC-derived neural progenitor cells (NPCs), neurons, and cardiomyocytes
  • New in this release: 
    • Perturb-seq data in MDA-MB-468 breast cancer cells (+/- treatment)
  • Perturb-seq data in KOLF2.1J iPSCs (undifferentiated)
  • SEC-MS data in KOLF2.1J iPSCs (undifferentiated, NPC, neuron, and cardiomyocyte)
  • SEC-MS data in MDA-MB-468 breast cancer cells (+/- treatment)
  • IF images in MDA-MB-468 breast cancer cells +/- treatment
  • New in this release: 
    • SEC-MS data in MDA-MB-468 breast cancer cells (+/- treatment)
    • RGB immunofluorescent images, corrections to ro-crate metadata, and changes to naming conventions
  • Perturb-seq data in KOLF2.1J iPSCs (undifferentiated)
  • SEC-MS data in KOLF2.1J iPSCs (undifferentiated, NPC, neuron, and cardiomyocyte)
  • IF images: in MDA-MB-468 breast cancer cells (+/- treatment)
  • Perturb-seq data in KOLF2.1J iPSCs (undifferentiated)
  • SEC-MS data in KOLF2.1J iPSCs (undifferentiated, NPC, neuron, and cardiomyocyte)
  • IF images in MDA-MB-468 breast cancer cells (+/- treatment) 

Precision Public Health Bridge2AI-Voice Dataset

About the Dataset

The Bridge2AI-Voice dataset explores the power of voice as a non-invasive, scalable biomarker linked to a wide range of health conditions—including neurological, mood, respiratory, and voice disorders. Designed to support responsible AI research, this ethically sourced dataset combines voice-derived features with detailed clinical and demographic data.

  • 833 adult participants across 5 North American sites 
  • 33,041 recordings (157.5 hrs)

View Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (3.0.0)

  • 442 adult participants across 5 North American sites
  • 19,271 recordings (86.13 hrs)
  • New in this version: raw audio data

View Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (2.0.0)

Released: Jan 17, 2025

  • 307 adult participants across 5 North American sites
  • 12,523 recordings (59.48 hrs)

View Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (1.1)

Salutogenesis Flagship Dataset of Type 2 Diabetes from the AI-READI Project

About the Dataset

The Artificial Intelligence Ready and Exploratory Atlas for Diabetes Insights (AI-READI) project aims to revolutionize how we understand and treat type 2 diabetes mellitus (T2DM) through ethically sourced, AI-optimized data. By assembling one of the most comprehensive multimodal datasets of its kind, AI-READI supports cutting-edge research into disease progression, recovery, and health-promoting (salutogenic) pathways. This dataset is built to advance AI/ML research while minimizing bias and enhancing reproducibility. Future versions will continue to expand coverage, diversity, and depth. Collected data spans multiple biological, physiological, and behavioral modalities and is designed to support pseudo-time manifold analysis, enabling researchers to reconstruct disease trajectories and identify opportunities for intervention.

  • 100 participants
  • 15,793 files (179.68 GB)
  • Note: This is not a full dataset. This dataset was created for pipeline development only and should not be used for conducting scientific investigations.
  • 2280 participants
  • 358,999 files (~3.82 TB total)
  • Collected between July 19, 2023 – May 1, 2025
  • 1067 participants
  • 165,051 files (~2.01 TB total)
  • Collected between July 19, 2023 – July 21, 2024
  • 204 participants
  • 21,669 files (~310 GB total)
  • Collected between July 19, 2023 – November 30, 2023

More Bridge2AI

Projects

Best Practices