Improving Human–AI System Sustainability in Health Via Aligned Relevance and Evaluation

About

As large language models (LLMs) become more integrated into digital health technologies, evaluating the effectiveness of human-AI systems in clinical and patient-facing contexts is increasingly critical. This project advances digital sustainability in healthcare by developing robust, context-sensitive evaluation metrics for human-AI collaboration. The team will develop two subprojects: the first constructs a benchmark to compare physician and LLM sentence-level relevance judgments in clinical vignettes to improve diagnostic accuracy and interpretability. The second subproject applies and validates evaluation metrics on a dataset of 3,000 human-only and AI-augmented responses generated through an online experiment with the prototype from the previous project phase. The outcome will be a validated framework of evaluation metrics and design guidelines for measuring the value of human-AI systems. This work contributes to the responsible and sustainable integration of AI in healthcare and aligns with the United Nations Sustainable Development Goals 3 and 9 by enhancing transparency, trust, and effectiveness in AI-augmented care.

Research Team

Prof. Marzyeh Ghassemi (MIT CSAIL-PI)
Prof. Dr. Ariel Stern (HPI-PI)
Aparna Balagopalan (MIT)
Danielly de Paula (HPI)
Yuexing Hao (MIT)
Vanessa Kipping (HPI)
Nikita Shishelyakin (HPI)