Improving Human–AI System Sustainability in Health Via Aligned Relevance and Evaluation
About
As large language models (LLMs) become more integrated into digital health technologies, evaluating the effectiveness of human-AI systems in clinical and patient-facing contexts is increasingly critical. This project advances digital sustainability in healthcare by developing robust, context-sensitive evaluation metrics for human-AI collaboration. The team will develop two subprojects: the first constructs a benchmark to compare physician and LLM sentence-level relevance judgments in clinical vignettes to improve diagnostic accuracy and interpretability. The second subproject applies and validates evaluation metrics on a dataset of 3,000 human-only and AI-augmented responses generated through an online experiment with the prototype from the previous project phase. The outcome will be a validated framework of evaluation metrics and design guidelines for measuring the value of human-AI systems. This work contributes to the responsible and sustainable integration of AI in healthcare and aligns with the United Nations Sustainable Development Goals 3 and 9 by enhancing transparency, trust, and effectiveness in AI-augmented care.
Principal Investigators
- Prof. Marzyeh Ghassemi (MIT CSAIL)
- Prof. Dr. Ariel Stern (HPI)