FIGURE LEGENDS
Figure 1. Schematic of the generative adversarial network (GAN) method. A GAN consists of two neural networks: the generator and the discriminator. The generator takes random variables from a latent space as input and computes generated data via its neural network. The discriminator takes the training data containing biomarkers and the generated data from the generator as inputs. The neural network in the discriminator is a binary classifier that computes the generator and discriminator loss functions that are used to update the generator and the discriminator neural networks via back propagation.
Figure 2. Figure 2 compares the probability density histogram of a representative generated data set from the generative adversarial network (teal bars) to the probability density histogram of the test data (salmon bars) from the univariate analyses of 8 diabetes-associated biomarkers. The dark gray bars correspond to the regions of overlap between the two probability density histograms. Eight biomarkers are shown: urine albumin (Figure 2A), urine creatinine (Figure 2B), fasting glucose (Figure 2C), insulin (Figure 2D), body mass index (Figure 2E), glycohemoglobin (Figure 2F), triglyceride (Figure 2G), and total cholesterol (Figure 2H). The x -axes on all graphs are biomarker levels that are log-transformed and scaled to lie between -1 and 1. Thep -values from the Kolmogorov-Smirnov test are shown on the top left.
Figure 3. The t -stochastic neighbor embedding (t-SNE, Figure 3A), uniform manifold approximation and projection (UMAP, Figure 3B) and principal component analysis two-dimensional projections of the 14-dimensional, diabetes-associated biomarkers data. The test data results are shown in salmon circles and the GAN-generated results are in teal circles. The x -axis (t-SNE X and UMAP X) and y -axis (t-SNE Y and UMAP) correspond to the t-SNE and UMAP projections into two dimensions of the input of 14-dimensional biomarker levels that are log-transformed and scaled to lie between -1 and 1. The PC 1 and PC 2 on the x -axis and y -axis of Figure 3C correspond to the first and second principal components, respectively. Figure 3D is a pairs panel that compares the univariate and bivariate GAN-generated distributions (teal circles) to the test data (salmon circles). The diagonal contains the univariate density for the GAN-generated and test data distributions. The area of overlap is shaded dark gray. The upper triangular region contains the Spearman bivariate correlation coefficients for the test (salmon font) and GAN-generated distributions (teal font). Only 7 of the 14 variables are shown. All variables were log-transformed and scaled to lie in the range [-1, 1]: ALB: Albumin, urine; CRE: Creatinine, urine; GLU: Fasting glucose; INS: Insulin; BMI: Body mass index; GLHB: Glycohemoglobin; TG: Triglyceride.
Figure 4. The t -stochastic neighbor embedding (t-SNE) two-dimensional projections of the 14-dimensional, diabetes-associated biomarkers data for the Black, Hispanic, Other and White race categories. The test data results are shown in salmon circles and the GAN-generated results are in teal circles. The x -axis (t-SNE X) and y -axis (t-SNE Y) correspond to the t-SNE projections into two dimensions of the input of 14-dimensional biomarker levels that are log-transformed and scaled to lie between -1 and 1.
Figure 5. Box plots of the univariate results from 14-dimensional, diabetes-associated biomarkers data for the Black, Hispanic, Other and White race categories. The test data are shown in salmon, and the GAN-generated results are in teal. The univariate results for eight of 14 diabetes-associated biomarkers are shown: urine albumin (Figure 5A), urine creatinine (Figure 5B), fasting glucose (Figure 5C), insulin (Figure 5D), body mass index (Figure 5E), glycohemoglobin (Figure 5F), triglyceride (Figure 5G), and total cholesterol (Figure 5H). The y -axes on all graphs are biomarker levels that are log-transformed and scaled to lie between -1 and 1. The lines on the box correspond to the 25th quantile, median and 75th quantile, the error bars correspond to the median ± 1.5 inter-quartile range and the outliers are in black circles.