A Feature Selection Method Based on Graph Theory for Cancer Classification


Cite item

Full Text

Abstract

Objective:Gene expression profile data is a good data source for people to study tumors, but gene expression data has the characteristics of high dimension and redundancy. Therefore, gene selection is a very important step in microarray data classification.

Method:In this paper, a feature selection method based on the maximum mutual information coefficient and graph theory is proposed. Each feature of gene expression data is treated as a vertex of the graph, and the maximum mutual information coefficient between genes is used to measure the relationship between the vertices to construct an undirected graph, and then the core and coritivity theory is used to determine the feature subset of gene data.

Results:In this work, we used three different classification models and three different evaluation metrics such as accuracy, F1-Score, and AUC to evaluate the classification performance to avoid reliance on any one classifier or evaluation metric. The experimental results on six different types of genetic data show that our proposed algorithm has high accuracy and robustness compared to other advanced feature selection methods.

Conclusion:In this method, the importance and correlation of features are considered at the same time, and the problem of gene selection in microarray data classification is solved.

About the authors

Kai Zhou

School of Mathematics Physics and Statistics, Shanghai University of Engineering Science

Email: info@benthamscience.net

Zhixiang Yin

School of Mathematics, Physics and Statistics,, Shanghai University of Engineering Science

Author for correspondence.
Email: info@benthamscience.net

Jiaying Gu

School of Mathematics Physics and Statistics, Shanghai University of Engineering Science

Email: info@benthamscience.net

Zhiliang Zeng

School of Mathematics Physics and Statistics, Shanghai University of Engineering Science,

Email: info@benthamscience.net

References

  1. Thakur, T.; Batra, I.; Luthra, M.; Vimal, S.; Dhiman, G.; Malik, A.; Shabaz, M. Gene expression-assisted cancer prediction techniques. J. Healthc. Eng., 2021, 2021, 4242646. doi: 10.1155/2021/4242646
  2. Taguchi, Y-H.; Turki, T. Integrated analysis of tissue-specific gene expression in diabetes by tensor decomposition can identify possible associated diseases. Genes, 2022, 13(6), 1097. doi: 10.3390/genes13061097 PMID: 35741859
  3. Abdulla, M.; Khasawneh, M.T. G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays. Artif. Intell. Med., 2020, 108, 101941. doi: 10.1016/j.artmed.2020.101941 PMID: 32972668
  4. Zhang, H. Feature selection using approximate conditional entropy based on fuzzy information granule for gene expression data classification. Front. Genet., 2021, 12, 631505. doi: 10.3389/fgene.2021.631505 PMID: 33859666
  5. Sun, L.; Zhang, X.; Qian, Y.; Xu, J.; Zhang, S. Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification. Inf. Sci., 2019, 502, 18-41. doi: 10.1016/j.ins.2019.05.072
  6. Manikandan, G.; Abirami, S. Feature selection is important: state-of-the-art methods and application domains of feature selection on high-dimensional data. In: Applications in Ubiquitous Computing; Springer, 2021; pp. 177-196. doi: 10.1007/978-3-030-35280-6_9
  7. Singh, R.K.; Sivabalakrishnan, M. Feature selection of gene expression data for cancer classification: a review. Procedia Comput. Sci., 2015, 50, 52-57. doi: 10.1016/j.procs.2015.04.060
  8. Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol., 2005, 3(2), 185-205. doi: 10.1142/S0219720005001004 PMID: 15852500
  9. Yu, L.; Liu, H. In Feature selection for high-dimensional data: A fast correlation-based filter solution Proceedings of the 20th international conference on machine learning (ICML-03), Aug 21-24, 2003, DC, United States, 2003, pp. 856-863.
  10. Huber, W.; von Heydebreck, A.; Sültmann, H.; Poustka, A.; Vingron, M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 2002, 18(Suppl. 1), S96-S104. doi: 10.1093/bioinformatics/18.suppl_1.S96 PMID: 12169536
  11. Li, L.; Weinberg, C.R.; Darden, T.A.; Pedersen, L.G. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 2001, 17(12), 1131-1142. doi: 10.1093/bioinformatics/17.12.1131 PMID: 11751221
  12. Chatra, K.; Kuppili, V.; Edla, D.R.; Verma, A.K. Cancer data classification using binary bat optimization and extreme learning machine with a novel fitness function. Med. Biol. Eng. Comput., 2019, 57(12), 2673-2682. doi: 10.1007/s11517-019-02043-5 PMID: 31713709
  13. Geurts, P.; Fillet, M.; de Seny, D.; Meuwis, M.A.; Malaise, M.; Merville, M.P.; Wehenkel, L. Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics, 2005, 21(14), 3138-3145. doi: 10.1093/bioinformatics/bti494 PMID: 15890743
  14. Ball, G.; Mian, S.; Holding, F.; Allibone, R.O.; Lowe, J.; Ali, S.; Li, G.; McCardle, S.; Ellis, I.O.; Creaser, C.; Rees, R.C. An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers. Bioinformatics, 2002, 18(3), 395-404. doi: 10.1093/bioinformatics/18.3.395 PMID: 11934738
  15. Ahmad, S.; Mehfuz, S.; Mebarek-Oudina, F.; Beg, J. J. C. C. RSM analysis based cloud access security broker: A systematic literature review. Cluster Comput., 2022, 25(5), 3733-3763. doi: 10.1007/s10586-022-03598-z
  16. Myat, T.N.; Mebarek-Oudina, F.; Hlaing, S.S.; Nadeem, A.K. Otsu’s thresholding technique for MRI image brain tumor segmentation. Multimed. Tools. Appl., 2022, 81(30), 43837-43849. doi: 10.1007/s11042-022-13215-1
  17. Rostami, M.; Berahmand, K.; Forouzandeh, S. A novel community detection based genetic algorithm for feature selection. J. Big Data, 2021, 8(1), 1-27. doi: 10.1186/s40537-020-00398-3
  18. Bandyopadhyay, S.; Bhadra, T.; Mitra, P.; Maulik, U. Integration of dense subgraph finding with feature clustering for unsupervised feature selection. Pattern Recognit. Lett., 2014, 40, 104-112. doi: 10.1016/j.patrec.2013.12.008
  19. Nasarian, E.; Abdar, M.; Fahami, M. A.; Alizadehsani, R.; Hussain, S.; Basiri, M. E.; Zomorodi-Moghadam, M.; Zhou, X.; Pławiak, P.; Acharya, U. Association between work-related features and coronary artery disease: A heterogeneous hybrid feature selection integrated with balancing approach. Pattern Recognit. Lett., 2020, 133, 33-40. doi: 10.1016/j.patrec.2020.02.010
  20. Lu, H.; Chen, J.; Yan, K.; Jin, Q.; Xue, Y.; Gao, Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing, 2017, 256, 56-62. doi: 10.1016/j.neucom.2016.07.080
  21. Alshamlan, H.; Badr, G.; Alohali, Y. mRMR-ABC: A hybrid gene selection algorithm for cancer classification using microarray gene expression profiling. Biomed. Res. Int., 2015, 2015, 604910. doi: 10.1155/2015/604910
  22. Alhenawi, E.; Al-Sayyed, R.; Hudaib, A.; Mirjalili, S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput. Biol. Med., 2022, 140, 105051. doi: 10.1016/j.compbiomed.2021.105051 PMID: 34839186
  23. Almugren, N.; Alshamlan, H. A survey on hybrid feature selection methods in microarray gene expression data for cancer classification. IEEE access, 2019, 7, 78533-78548. doi: 10.1109/ACCESS.2019.2922987
  24. Chinnaswamy, A.; Srinivasan, R. Hybrid feature selection using correlation coefficient and particle swarm optimization on microarray gene expression data. In: Innovations in bio-inspired computing and applications; Springer, 2016; pp. 229-239. doi: 10.1007/978-3-319-28031-8_20
  25. Pragadeesh, C.; Jeyaraj, R.; Siranjeevi, K.; Abishek, R.; Jeyakumar, G. Hybrid feature selection using micro genetic algorithm on microarray gene expression data. J. Intell. Fuzzy Syst., 2019, 36(3), 2241-2246. doi: 10.3233/JIFS-169935
  26. Singh, P.; Shukla, A.; Vardhan, M. In Hybrid approach for gene selection and classification using filter and genetic algorithm 2017 International Conference on Inventive Computing and Informatics (ICICI), 23-24 Nov, 2017, Coimbatore, India, 2017, pp. 832-837. doi: 10.1109/ICICI.2017.8365253
  27. Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A.; Benítez, J.M.; Herrera, F. A review of microarray datasets and applied feature selection methods. Inf. Sci., 2014, 282, 111-135. doi: 10.1016/j.ins.2014.05.042
  28. Bolón-Canedo, V.; Alonso-Betanzos, A. Ensembles for feature selection: A review and future trends. Inf. Fusion, 2019, 52, 1-12. doi: 10.1016/j.inffus.2018.11.008
  29. Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; Bloomfield, C.D.; Lander, E.S. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 1999, 286(5439), 531-537. doi: 10.1126/science.286.5439.531 PMID: 10521349
  30. Model, F.; Adorján, P.; Olek, A.; Piepenbrock, C. Feature selection for DNA methylation based cancer classification. Bioinformatics, 2001, 17(S1), S157-S164. doi: 10.1093/bioinformatics/17.suppl_1.S157 PMID: 11473005
  31. Tang, J.; Zhou, S. A new approach for feature selection from microarray data based on mutual information. IEEE/ACM Trans. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 2016, 13(6), 1004-1015. doi: 10.1109/TCBB.2016.2515582 PMID: 26761857
  32. Hanchuan, P; Fuhui, L.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell., 2005, 27(8), 1226-1238. doi: 10.1109/TPAMI.2005.159 PMID: 16119262
  33. Kavitha, K.; Prakasan, A.; Dhrishya, P. In Score-based feature selection of gene expression data for cancer classification 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), 11-13 March 2020, Erode, India, 2020, pp. 261-266. doi: 10.1109/ICCMC48092.2020.ICCMC-00049
  34. Rostami, M.; Forouzandeh, S.; Berahmand, K.; Soltani, M.; Shahsavari, M.; Oussalah, M. Gene selection for microarray data classification via multi-objective graph theoretic-based method. Artif. Intell. Med., 2022, 123, 102228. doi: 10.1016/j.artmed.2021.102228 PMID: 34998517
  35. Ganjei, M.A.; Boostani, R. A hybrid feature selection scheme for high-dimensional data. Eng. Appl. Artif. Intell., 2022, 113, 104894. doi: 10.1016/j.engappai.2022.104894
  36. Hsu, H.H.; Hsieh, C.W.; Lu, M.D. Hybrid feature selection by combining filters and wrappers. Expert Syst. Appl., 2011, 38(7), 8144-8150. doi: 10.1016/j.eswa.2010.12.156
  37. Salem, H.; Attiya, G.; El-Fishawy, N. Classification of human cancer diseases by gene expression profiles. Appl. Soft. Comput., 2017, 50, 124-134. doi: 10.1016/j.asoc.2016.11.026
  38. Wang, Y.; Gao, X.; Ru, X.; Sun, P.; Wang, J. A hybrid feature selection algorithm and its application in bioinformatics. PeerJ Comput. Sci., 2022, 8, e933. doi: 10.7717/peerj-cs.933 PMID: 35494789
  39. Djellali, H.; Zine, N.G.; Azizi, N. Two stages feature selection based on filter ranking methods and SVMRFE on medical applications. In: Modelling and implementation of complex systems; Springer, 2016; pp. 281-293. doi: 10.1007/978-3-319-33410-3_20
  40. Sadeghian, Z.; Akbari, E.; Nematzadeh, H. A hybrid feature selection method based on information theory and binary butterfly optimization algorithm. Eng. Appl. Artif. Intell., 2021, 97, 104079. doi: 10.1016/j.engappai.2020.104079
  41. Liu, J.B.; Zhang, T.; Wang, Y.; Lin, W. The Kirchhoff index and spanning trees of Möbius/cylinder octagonal chain. Discrete Appl. Math., 2022, 307, 22-31. doi: 10.1016/j.dam.2021.10.004
  42. Liu, J.B.; Bao, Y.; Zheng, W.T. Analyses of some structural properties on a class of hierarchical scale-free networks. arXiv:2203.12361, 2022.
  43. Goswami, S.; Das, A.K.; Guha, P.; Tarafdar, A.; Chakraborty, S.; Chakrabarti, A.; Chakraborty, B. An approach of feature selection using graph-theoretic heuristic and hill climbing. Pattern Anal. Appl., 2019, 22(2), 615-631. doi: 10.1007/s10044-017-0668-x
  44. Henni, K.; Mezghani, N.; Gouin-Vallerand, C. Unsupervised graph-based feature selection via subspace and pagerank centrality. Expert Syst. Appl., 2018, 114, 46-53. doi: 10.1016/j.eswa.2018.07.029
  45. Hashemi, A.; Dowlatshahi, M.B.; Nezamabadi-pour, H. MGFS: A multi-label graph-based feature selection algorithm via PageRank centrality. Expert Syst. Appl., 2020, 142, 113024. doi: 10.1016/j.eswa.2019.113024
  46. Roffo, G.; Melzi, S.; Castellani, U.; Vinciarelli, A.; Cristani, M. Infinite feature selection: a graph-based feature filtering approach. IEEE Trans. Pattern Anal. Mach. Intell., 2021, 43(12), 4396-4410. doi: 10.1109/TPAMI.2020.3002843 PMID: 32750789
  47. Das, A.K.; Kumar, S.; Jain, S.; Goswami, S.; Chakrabarti, A.; Chakraborty, B. An information-theoretic graph-based approach for feature selection. Sadhana, 2020, 45(1), 11. doi: 10.1007/s12046-019-1238-2
  48. Jin, X. On system core and coritivity (I). J. Syst. Sci. Math. Sci., 1993, 13(2), 102.
  49. Reshef, D.N.; Reshef, Y.A.; Finucane, H.K.; Grossman, S.R.; McVean, G.; Turnbaugh, P.J.; Lander, E.S.; Mitzenmacher, M.; Sabeti, P.C. Detecting novel associations in large data sets. Science, 2011, 334(6062), 1518-1524. doi: 10.1126/science.1205438 PMID: 22174245
  50. Akoglu, H. User’s guide to correlation coefficients. Turk. J. Emerg. Med., 2018, 18(3), 91-93. doi: 10.1016/j.tjem.2018.08.001 PMID: 30191186
  51. Zhou, H.; Wang, X.; Zhu, R. Feature selection based on mutual information with correlation coefficient. Appl. Intell., 2022, 52(5), 5457-5474. doi: 10.1007/s10489-021-02524-x
  52. Lin, G.; Lin, A.; Gu, D. Using support vector regression and Knearest neighbors for short-term traffic flow prediction based on maximal information coefficient. Inf. Sci., 2022, 608, 517-531. doi: 10.1016/j.ins.2022.06.090
  53. Yao, L.; Shen, H.; Laird, P.W.; Farnham, P.J.; Berman, B.P. Inferring regulatory element landscapes and transcription factor networks from cancer methylomes. Genome Biol., 2015, 16(1), 105. doi: 10.1186/s13059-015-0668-3 PMID: 25994056
  54. Ge, R.; Zhou, M.; Luo, Y.; Meng, Q.; Mai, G.; Ma, D.; Wang, G.; Zhou, F. McTwo: a two-step feature selection algorithm based on maximal information coefficient. BMC Bioinformatics, 2016, 17(1), 142. doi: 10.1186/s12859-016-0990-0 PMID: 27006077
  55. Wang, Y.; Li, X.; Ruiz, R. Feature selection with maximal relevance and minimal supervised redundancy. IEEE Trans. Cybern., 2022, 53(2), 707-717. PMID: 35130179
  56. Bennasar, M.; Hicks, Y.; Setchi, R. Feature selection using joint mutual information maximisation. Expert Syst. Appl., 2015, 42(22), 8520-8532. doi: 10.1016/j.eswa.2015.07.007

Supplementary files

Supplementary Files
Action
1. JATS XML

Copyright (c) 2024 Bentham Science Publishers