2.7. Data analysis
LFQ data were normalized by total ion current (TIC) and filtered for 50% valid intensity values across all samples. Missing values were replaced by 1/5 of the minimum positive value of each variable by MetaboAnalyst 5.0 [21]. Quantified proteins with fold change > 2 or < 0.5 and P value < 0.05 were considered as DEPs. In the figures, experimental data are shown as standard error of mean.
Metascape [22] was utilized for functional enrichment and protein-protein interaction networks analysis. P values for the functional enrichments were calculated by a hypergeometric test and corrected by the Benjamini-Hochberg FDR method. Cytoscape [23] software was used for reorganizing and visualizing the interaction networks. The proportional Venn diagrams and the Sankey diagram were analyzed using a Bioinformatics online tool. The artwork was created with BioRender.com. MetaboAnalyst 5.0 [21] was used for the statistical analysis and biomarker discovery of DEPs, including unsupervised clustering, PCA, Pearson correlation analysis, and machine learning.
For machine learning, ROC curves were generated using MetaboAnalyst 5.0. Multivariate ROC curves were generated by Monte Carlo cross-validation (MCCV) using balanced sub-sampling. In each MCCV, two-thirds of the samples were used to evaluate the feature importance. The top 2, 3, 5, 10 …100 (max) important features were then used to build classification models, which were validated using one-third of the remaining samples. The procedure was repeated multiple times to calculate the performance and confidence interval of each model. PLS-DA was used as the classification method, and the PLS-DA built-in was selected as the feature-ranking method with two latent variables. Feature selection was based on the ROC curve results, and the top 5, 10, 15, 25, 50, and 100 proteins were used for predictive accuracy assessment.