DISCUSSION
The GAN approach investigated integrates AI techniques with the large public domain NHANES database containing biomedical information on diverse populations that could prove valuable in pharmacometrics applications. Proof-of-concept computational experiments were conducted to evaluate the capabilities of GANs to simulate univariate distributions of a test bed of 16 diabetes-relevant biomarkers. In the next step, the GAN strategy was extended to complex joint distributions of multiple biomarkers and finally, a conditional GAN was used for modeling of Black, Hispanic, Other and White race/ethnicity categories. The training-test strategy was used for GAN performance evaluation.
The GAN strategy enables robust learning and can be considered “non-parametric” because it does not need prior distributions, which are required for Bayesian approaches. While the latent space for a GAN generator is sampled from a multivariate Gaussian distribution, it serves only as a source of random noise for the generator neural network to transform. Notably, the GAN architecture is indirect because it does not conduct head-to-head comparison of the generated data distribution vs. training data distribution. GANs avert direct comparison by intercalating a binary classifier and judicious use of the adversarial loss functions. The literature on GANs in pharmacometrics is sparse. Parikh et al . 14 have used GANs to generate instances of models for cardiac mechanics in control myocytes and myocytes treated with omecamtiv mecarbil, a new drug for treating heart failure. The GANs were used to find model parameters for fitting the data for both groups. This application of GANs to in vitro data differs qualitatively from the patient-centric problem in our research.
Conditional GANs are an extension of GANs wherein the generator and discriminator networks are conditioned with additional input. Conditional GANs are particularly useful for modeling multimodal data and have been used elsewhere for tagging and annotating images15. We found that biomarker profile joint distribution could be modeled using GAN architectures effective for tabular data, which can consist of multiple data types, e.g., continuous variables, ordinal, and categorical. Tabular data generation presents some unique challenges as compared to GAN modeling of images because: i) columns in a row do not have local structure and, ii) conditioned variable-dependent continuous variables are generally multimodal (i.e., the density function has several peaks). The typical GAN architectures designed for images are not particularly good at generating multimodal data because of a phenomenon termed “mode collapse”. Mode collapse reduces the diversity of output samples and occurs when the generator can only produce a single type of output or a small set of outputs that fool the discriminator 13. To simultaneously generate a mix of discrete and continuous columns, the Xu et al .12 GAN approach applies both softmax and tanh on the output. We used the PacGAN method, wherein the discriminator decision-making is guided by multiple or “packed” samples from each class 13. In PacGAN, the discriminator does not classify each generated sample but instead, examines a “pack” of samples for a class. Thus, diversity of the generated samples becomes a criterion for the discriminator in the classification process and helps avoid mode collapse. By implementing these enhancements12,13, we found that a conditional GAN yielded effective results for modeling race/ethnicity. The approach addresses the frequency differences between the various under-represented groups, and the multimodality resulting from between-group differences in biomarker expression.
We selected 16 diverse diabetes-relevant physiological biomarkers that reflected different organ systems and become clinically salient at different stages of diabetes progression. Alterations to plasma glucose and insulin profiles are direct consequences of diabetes and can be dysregulated early in diabetes because of decreased pancreatic β-cell function or increased insulin resistance in hepatic and peripheral tissues. Glycohemoglobin is related to the average glucose exposure over 2-3 months. In contrast, increased urinary creatinine and albumin are the result of compromised renal function during diabetes disease progression. We also included integrative biomarkers, e.g., body mass index and systolic blood pressure, metabolic biomarkers, e.g., triglycerides and cholesterol, inflammatory biomarkers (C-reactive protein and ferritin) and hepatic biomarkers (e.g., alanine aminotransferase, aspartate aminotransferase and gamma glutamyltransferase) that are dysregulated in diabetes.
One of the strengths of the NHANES as a source of “big data” for modeling under-represented groups is that while the total sample size in a given cycle is fixed, the survey adapts its population-based sampling strategy to include adequate numbers of individuals from under-represented groups, e.g., there is ongoing oversampling of Hispanics, non-Hispanic Blacks, older adults, and low income whites/others groups and beginning in 2011, non-Hispanic Asians were oversampled 16. We used the RIDRETH1 variable from NHANES to derive our under-represented groups; additional race-ethnicity variables have been added to NHANES, but these variables were not available across all the datasets we used. A weakness is that the NHANES sample is limited to the non-institutionalized civilian resident population: it does not contain groups such as prisoners, military personnel, individuals in psychiatric institutions, and drug rehabilitation facilities. Interestingly, Allen et al . and Riegeret al . also leveraged NHANES data in their work on virtual patients 17,18. We have previously used NHANES as the data source in the generalized pharmacometrics modeling (GPM) approach, which integrates population models with AI techniques. GPM simulates pharmacokinetic (PK) parameters from population PK covariate models using Bayesian networks that include demographic and biomarker features identified from NHANES. The integration of external data enables GPM to facilitate modeling and simulation of drug disposition and effects for populations different from those in the underlying PK study7.
Creating virtual populations requires modeling or otherwise sampling the joint distribution of biomarkers of interest. If the biomarkers are not normally distributed or if there are multiple biomarkers of interest, covariance matrices are generally inadequate for characterizing higher-order inter-dependencies. General empirically-motivated methods for producing virtual patient populations include patient selection using inclusion and exclusion criteria 19, bootstrapping similar clinical trials or patient databases20 and simulating from fitted distributions21. Simulated annealing and nested simulated annealing-based methods have been proposed for generating “plausible” populations in the context of quantitative systems pharmacology models17,18. Our GAN approach relies on neural network-based learning and is generative, i.e., it creates new sample sets: it differs substantially from the non-parametric re-sampling and parametric Bayesian approaches that have been used in pharmacometrics for approximating data distributions.
GANs are considered a deep learning (DL) method as many GANs require deep neural networks (DNN; “deep” refers to the number of network layers) for the generator and discriminator architectures. Although there is increasing interest in leveraging AI approaches including DL in drug discovery and development, the assessments of DL and GANs in pharmacometrics have been preliminary 22,23. Liuet al . 23 used long short-term memory (LSTM, a common neural network architecture that is effective for time series) DNN to model simulated PK/PD data of a hypothetical drug. The plasma concentration and effect level under one dosing regimen was used to train the model and the model was used to predict the individual PK/PD for other dosing regimens. Lu 22 included neural ordinary differential equations for forecasting PK/PD of platelet responses in a clinical dataset of 800 patients. It should be noted that like many AI and DL methods, GAN methods can be computationally intensive; however, graphic processing units (GPU) and high-performance computing (HPC) architectures can improve the performance of AI algorithms substantially 24,25.
Our results demonstrate the potential of the GAN approach for modeling the joint distribution of complex systems of disease-relevant biomarkers in under-represented groups. The approach may find utility for generating virtual patient populations for clinical trial simulations and pharmacometrics.