AI&DHI Innovation Spotlight: Automated Harmonization of Multi-institutional Electronic Health Records Data

 

This edition of the AI&DHI Innovation Spotlight is focused on work led by Dr. Xu Shi, Associate Professor of Biostatistics in the School of Public Health. Dr. Shi and her team are developing statistical methods and computational tools to address challenges in medical coding heterogeneity across large-scale electronic health record (EHR)–linked biobanks.


A major challenge in using EHR data for research is the inconsistency in how clinical data is recorded and labeled. Every hospital, clinic or healthcare network may use different terminologies, coding systems, formats and levels of detail when documenting the same medical condition or event. Known as medical coding heterogeneity, this lack of standardization creates obstacles in EHR-based research. For example, if the same disease is labeled in multiple ways across institutions, researchers may miss cases or classify patients incorrectly, introducing bias into their studies or even producing misleading results. 

To address this challenge, and with support from AI & Digital Health Innovation (AI&DHI), Dr. Shi and her team are creating and applying new statistical methods and computer programs to describe and reduce differences in how medical events are recorded in two large EHR-linked databases: the Michigan Genomics Initiative (MGI) and the UK Biobank.

“Biobanks like these provide a powerful resource for discovering how genes and environment influence health, but their promise relies on being able to accurately and consistently identify diseases and traits from clinical records,” said Dr. Shi. “If researchers can’t confidently determine who has a given condition due to inconsistent coding, it weakens the reliability of downstream genetic analyses.”

In looking at summary data, the team found major differences in how information is coded between the MGI and UK databases. The team is now exploring the use of data-driven methods to match up and harmonize different datasets to make it easier to combine data from different biobanks and help ensure that study results are reliable.

Statistical harmonization allows us to see through the noise of fragmented coding systems and uncover the true clinical signals hiding in EHR data.
— Quote Source

The team’s research could dramatically reduce the time, cost and effort needed to make EHR data ready for research, enabling large-scale, multi-institutional studies that are more representative and generalizable.

“I am deeply grateful for the support from AI&DHI, which provided funding for my master’s and PhD students as well as for my collaborators,” said Dr. Shi. “This support was instrumental in enabling our team to pursue the development of statistical methods and computational tools to address challenges in medical coding heterogeneity across large-scale EHR-linked biobanks. Through AI&DHI, we were also able to leverage curated Michigan Genomics Initiative (MGI) data, which was critical for evaluating and validating our approaches.”

Dr. Shi also notes that the team’s work greatly benefited from an interdisciplinary collaboration with Dr. Lars Fritsche, Associate Research Scientist of Biostatistics, and Dr. VG Vinod Vydiswaran, Associate Professor of Learning Health Sciences and Information, whose expertise enriched the project’s methodology and broadened its impact.  

“Statistical harmonization allows us to see through the noise of fragmented coding systems and uncover the true clinical signals hiding in EHR data,” said Dr. Shi. “AI&DHI’s support not only advanced our research in this area but also strengthened the integrity and generalizability of our findings. This work highlights the value of interdisciplinary and data-driven collaboration fostered by AI&DHI in advancing biomedical informatics research.”


Publication

Xu Shi, Yuqi Zhai, Xianshi Yu, Xiaoou Li, Brian L Hazlehurst, Denis B Nyongesa, Daniel S Sapp, Brian D Williamson, David S Carrell, Luesa Healy, Kara L Cushing-Haugen, Jenna Wong, Shirley V Wang, James S Floyd, Kathleen Shattuck, Samuel McGown, Sarah Alam, José J Hernández-Muñoz, Jie Li, Yong Ma, Danijela Stojanovic, Sudha R Raman, Sharon E Davis, Tianxi Cai, Jennifer C Nelson, Patrick J Heagerty, Statistical Methods to Harmonize Electronic Health Record Data Across Healthcare Systems: Case Study and Lessons Learned, Bioinformatics, 2026;, btag107, https://doi.org/10.1093/bioinformatics/btag107

About AI & Digital Health Innovation

AI & Digital Health Innovation (formerly Precision Health at U-M) is dedicated to empowering researchers at the University Michigan to change the future of digital healthcare. They work with multi-disciplinary teams of health providers, basic scientists, engineers, and administrators to tackle the most difficult research problems and help rapidly bring ideas to the bedside. For more information visit aidhi.umich.edu.

Next
Next

AI in healthcare: How engineers and clinicians at U-M are unlocking its potential together