Research on Feature Dimension Reduction Method of Serum Mass Spectrometry Data Based on Machine Learning

Mass spectrometry data is a kind of data describing the mass-to-charge ratio and relative intensity of compounds, which has the characteristics of high dimensionality, high noise, and large arithmetic capacity in processing, etc. Among them, serum mass spectrometry data can provide key information in the diagnosis and treatment of colon cancer, so it is of great significance to carry out a reasonable dimensionality reduction on mass spectrometry data. Currently, feature dimensionality reduction is mainly divided into two categories: Feature Extraction and Feature Selection, and each of the two methods has its own advantages and disadvantages. In this paper, we will use PCA - SVM, PCA - RF, RFE - SVM and RFE - RF methods to carry out dimensionality approximation and classification of preprocessed serum mass spectrometry data. And the classification accuracy is obtained by optimising the parameters. The classification results of the data are outputted in the confusion matrix, and the classification of different categories can be seen. The results show that the model based on Recursive Feature Elimination is better than the model based on Principal Component Analysis for classification, which is beneficial for medical practitioners to diagnose patients more efficiently.