Research

MS-NIMBLE: an accurate and scalable framework for metabolomics data with latent confounds and non-ignorable missing observations

Metabolomics is the high-throughput study of small molecule metabolites, and has the potential to lead to new insights into the origin of human disease and drug metabolism. Unlike other high-throughput biological data, metabolomic data contain a vast amount of missing data, nearly all of which is missing not at random due to an unknown, metabolite-specific missingness mechanism. Additionally, these data contain a multitude of latent factors that can confound relationships of interest if one does not properly account for them. Unfortunately, current methods for analyzing these data can only account for the missing data or latent factors, but not both. We therefore developed a statistically and computationally efficient method to account for both non-random missing data and latent confounding factors in metabolomic experiments. We showed our method can be used to solve several critical problems in metabolomics, including biomarker discovery, dimension reduction, and inference in metabolite genome wide association studies.

This problem involved extensive work in both theoretical and applied statistics. As the current state of the art in metabolomic data analysis is to impute missing data, I first proved that hypothesis testing regarding the relationship between metabolite levels and a covariate of interest in imputed metabolomic data is only accurate if the covariate of interest is independent of metabolite levels and all other observed covariates. As metabolomic data are influenced by a number of nuisance covariates that confound the relationship between metabolite levels and the covariate of interest, this result helped explain why it is critical to properly account for the non-random missing data in metabolomic experiments. Next, I developed and implemented new methodology. Using Dr. McKennan’s previous work to estimate metabolite-specific missingness mechanisms and latent confounders as a starting point, I first constructed a full data likelihood, and proved that we could efficiently estimate parameters of interest using Gaussian quadrature. This was critical, as it made inference using our method computationally tractable, and therefore more likely to be used by biological practitioners. I then devised a novel finite-sample corrected estimator for the variance of estimated parameters, and implemented our methodology in an easy-to-use R package. Next, I used simulated data to compare the performance of our new method with existing software, and showed that our method unequivocally outperforms all existing methods on a suite of problems, including metabolite biomarker discovery and dimension reduction. I also evaluated our method’s performance on real metabolomic data, where I showed that our method identifies a greater number of plausible childhood asthma-related metabolites.