RT info:eu-repo/semantics/article
T1 A Comparative Study on Feature Selection for a Risk Prediction Model for Colorectal Cancer
A1 Cueto López, Nahúm
A1 García Ordás, María Teresa
A1 Dávila Batista, Verónica
A1 Aragonés, Nuria
A1 Alaiz Rodríguez, Rocío
A1 Moreno, Víctor
A2 Ingenieria de Sistemas y Automatica
K1 Matemáticas
K1 Medicina. Salud
K1 Colorectal cancer
K1 Risk prediction model
K1 Feature selection
K1 Stability
K1 2404 Biomatemáticas
AB [EN]Background and objective: Risk prediction models aim at identifying people at higher risk of developinga target disease. Feature selection is particularly important to improve the prediction model performanceavoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stability of feature selection/ranking algorithms becomes an important issue when the aim is to analyze thefeatures with more prediction power.Methods: This work is focused on colorectal cancer, assessing several feature ranking algorithms in termsof performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM),Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluatedfollowing a conventional approach with scalar stability metrics and a visual approach proposed in thiswork to study both similarity among feature ranking techniques as well as their individual stability. Acomparative analysis is carried out between the most relevant features found out in this study and features provided by the experts according to the state-of-the-art knowledge.Results: The two best performance results in terms of Area Under the ROC Curve (AUC) are achieved witha SVM classifier using the top-41 features selected by the SVM wrapper approach (AUC=0.693) and Logistic Regression with the top-40 features selected by the Pearson (AUC=0.689). Experiments showed thatperforming feature selection contributes to classification performance with a 3.9% and 1.9% improvementin AUC for the SVM and Logistic Regression classifier, respectively, with respect to the results using thefull feature set. The visual approach proposed in this work allows to see that the Neural Network-basedwrapper ranking is the most unstable while the Random Forest is the most stable.Conclusions: This study demonstrates that stability and model performance should be studied jointlyas Random Forest turned out to be the most stable algorithm but outperformed by others in terms ofmodel performance while SVM wrapper and the Pearson correlation coefficient are moderately stablewhile achieving good model performance.© 2019 Elsevier B.V. All rights reserved
PB Elsevier
SN 0169-2607
LK https://hdl.handle.net/10612/17654
UL https://hdl.handle.net/10612/17654
NO Cueto-López, N., García-Ordás, M. T., Dávila-Batista, V., Moreno, V., Aragonés, N., & Alaiz-Rodríguez, R. (2019). A comparative study on feature selection for a risk prediction model for colorectal cancer. Computer Methods and Programs in Biomedicine, 177, 219-229. https://doi.org/10.1016/J.CMPB.2019.06.001
DS BULERIA. Repositorio Institucional de la Universidad de León
RD 17-may-2024