Compartir
Título
A Comparative Study on Feature Selection for a Risk Prediction Model for Colorectal Cancer
Autor
Facultad/Centro
Área de conocimiento
Título de la revista
Computer Methods and Programs in Biomedicine
Cita Bibliográfica
Cueto-López, N., García-Ordás, M. T., Dávila-Batista, V., Moreno, V., Aragonés, N., & Alaiz-Rodríguez, R. (2019). A comparative study on feature selection for a risk prediction model for colorectal cancer. Computer Methods and Programs in Biomedicine, 177, 219-229. https://doi.org/10.1016/J.CMPB.2019.06.001
Editorial
Elsevier
Fecha
2019-06-02
ISSN
0169-2607
Resumen
[EN]Background and objective: Risk prediction models aim at identifying people at higher risk of developing
a target disease. Feature selection is particularly important to improve the prediction model performance
avoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stability of feature selection/ranking algorithms becomes an important issue when the aim is to analyze the
features with more prediction power.
Methods: This work is focused on colorectal cancer, assessing several feature ranking algorithms in terms
of performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM),
Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluated
following a conventional approach with scalar stability metrics and a visual approach proposed in this
work to study both similarity among feature ranking techniques as well as their individual stability. A
comparative analysis is carried out between the most relevant features found out in this study and features provided by the experts according to the state-of-the-art knowledge.
Results: The two best performance results in terms of Area Under the ROC Curve (AUC) are achieved with
a SVM classifier using the top-41 features selected by the SVM wrapper approach (AUC=0.693) and Logistic Regression with the top-40 features selected by the Pearson (AUC=0.689). Experiments showed that
performing feature selection contributes to classification performance with a 3.9% and 1.9% improvement
in AUC for the SVM and Logistic Regression classifier, respectively, with respect to the results using the
full feature set. The visual approach proposed in this work allows to see that the Neural Network-based
wrapper ranking is the most unstable while the Random Forest is the most stable.
Conclusions: This study demonstrates that stability and model performance should be studied jointly
as Random Forest turned out to be the most stable algorithm but outperformed by others in terms of
model performance while SVM wrapper and the Pearson correlation coefficient are moderately stable
while achieving good model performance.
© 2019 Elsevier B.V. All rights reserved
Materia
Palabras clave
Peer review
SI
URI
DOI
Aparece en las colecciones
- Artículos [4665]
Ficheros en el ítem
Tamaño:
1.358
xmlui.dri2xhtml.METS-1.0.size-megabytes
Formato:
Adobe PDF