RT info:eu-repo/semantics/article T1 A Comparative Study on Feature Selection for a Risk Prediction Model for Colorectal Cancer A1 Cueto López, Nahúm A1 García Ordás, María Teresa A1 Dávila Batista, Verónica A1 Aragonés, Nuria A1 Alaiz Rodríguez, Rocío A1 Moreno, Víctor A2 Ingenieria de Sistemas y Automatica K1 Matemáticas K1 Medicina. Salud K1 Colorectal cancer K1 Risk prediction model K1 Feature selection K1 Stability K1 2404 Biomatemáticas AB [EN]Background and objective: Risk prediction models aim at identifying people at higher risk of developinga target disease. Feature selection is particularly important to improve the prediction model performanceavoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stability of feature selection/ranking algorithms becomes an important issue when the aim is to analyze thefeatures with more prediction power.Methods: This work is focused on colorectal cancer, assessing several feature ranking algorithms in termsof performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM),Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluatedfollowing a conventional approach with scalar stability metrics and a visual approach proposed in thiswork to study both similarity among feature ranking techniques as well as their individual stability. Acomparative analysis is carried out between the most relevant features found out in this study and features provided by the experts according to the state-of-the-art knowledge.Results: The two best performance results in terms of Area Under the ROC Curve (AUC) are achieved witha SVM classifier using the top-41 features selected by the SVM wrapper approach (AUC=0.693) and Logistic Regression with the top-40 features selected by the Pearson (AUC=0.689). Experiments showed thatperforming feature selection contributes to classification performance with a 3.9% and 1.9% improvementin AUC for the SVM and Logistic Regression classifier, respectively, with respect to the results using thefull feature set. The visual approach proposed in this work allows to see that the Neural Network-basedwrapper ranking is the most unstable while the Random Forest is the most stable.Conclusions: This study demonstrates that stability and model performance should be studied jointlyas Random Forest turned out to be the most stable algorithm but outperformed by others in terms ofmodel performance while SVM wrapper and the Pearson correlation coefficient are moderately stablewhile achieving good model performance.© 2019 Elsevier B.V. All rights reserved PB Elsevier SN 0169-2607 LK https://hdl.handle.net/10612/17654 UL https://hdl.handle.net/10612/17654 NO Cueto-López, N., García-Ordás, M. T., Dávila-Batista, V., Moreno, V., Aragonés, N., & Alaiz-Rodríguez, R. (2019). A comparative study on feature selection for a risk prediction model for colorectal cancer. Computer Methods and Programs in Biomedicine, 177, 219-229. https://doi.org/10.1016/J.CMPB.2019.06.001 DS BULERIA. Repositorio Institucional de la Universidad de León RD 17-may-2024