This project classified COVID-19 severity (Severe vs. Non-severe) from mass spectrometry proteomic data using a leakage-free machine learning pipeline. The dataset started at 101,461 peptide variants across 268 columns and was reduced to a 43 patient by 200 peptide matrix through a structured preprocessing pipeline before any model saw the data.

Preprocessing and Feature Selection Pipeline

The pipeline ran in six stages, each applied strictly within cross-validation training folds to prevent the held-out patient from influencing any processing decision.

Column filtering removed 176 metadata columns and 2 lab reference channels, leaving 90 patient intensity columns. Zero replacement converted all zero intensity values to NaN, since a zero in mass spectrometry indicates a failed detection rather than a true measurement of zero abundance. After filtering to the 43 patients in the Severe and Non-severe groups and transposing the matrix, statistical feature selection reduced 101,461 peptides to 200. Selection used the Mann-Whitney U test rather than a t-test because mass spectrometry intensity distributions are not Gaussian, ranking peptides by p-value within each fold using only training patients. No peptide survived false discovery rate correction at this sample size, so the top 200 was used as a pragmatic cutoff. Expanding to 5,000 peptides was tested and produced worse performance due to overfitting at a feature-to-sample ratio of approximately 119.

Missing value imputation compared median imputation against zero imputation. Zero imputation was adopted because it produced a higher LOOCV ROC-AUC (0.967 vs. 0.944), consistent with the interpretation that a missing detection carries real biological signal. All imputation statistics were computed on training data only. Finally, features were standardized to zero mean and unit variance using a scaler fit on training data, applied only for logistic regression models since tree-based models are invariant to scaling.

Model Evaluation

Three classifiers were evaluated under Leave-One-Out Cross-Validation across all 43 patients: L2 logistic regression, L1 logistic regression, and random forest. LOOCV was chosen over a fixed train-test split because with only 43 patients, a single 9-patient held-out set means one misclassification shifts accuracy by roughly 11%, making any fixed split unreliable.

L2 logistic regression achieved 90.7% accuracy, F1 of 0.889, and ROC-AUC of 0.967 and was selected as the primary classification model. L1 logistic regression (ROC-AUC 0.927) was retained as a secondary model for its sparsity: it reduced the 200 input peptides to 6 with nonzero coefficients, providing a direct interpretable mapping from peptide abundance to disease severity. Random forest achieved the lowest recall at 61.1%, missing 7 of 18 Severe patients, the most clinically costly failure mode, and was not selected.

The original pipeline ran feature selection across all 43 patients before splitting, producing a perfect ROC-AUC of 1.000. Correcting this leakage dropped performance to 0.950 on a fixed test set, directly quantifying how much the leak inflated results.