loading page

Semi-supervised feature selection using maximum mutual information and minimum correlated feature set retrieved by augmented learning
  • +1
  • Arghya Kusum Das,
  • Saptarsi Goswami,
  • Amlan Chakrabarti,
  • Basabi Chakraborty
Arghya Kusum Das
Department of Computer Science & Engineering, Techno International New Town

Corresponding Author:[email protected]

Author Profile
Saptarsi Goswami
Department of Computer Science Bangabasi Morning College, University of Calcutta
Amlan Chakrabarti
A.K. Choudhury School of Information Technology, University of Calcutta
Basabi Chakraborty
Iwate Prefectural University

Abstract

Feature selection is a critical pre-processing step in machine learning. For supervised problems, class labels are used to identify important features. However, the tagging of the data is labor intensive and hence costly. Consequently, there is an abundance of unlabeled data and limited labeled data. Hence, semi-supervised learning is very pertinent. The problem of feature selection is equally relevant for semi-supervised learning. In this research work, a fresh semi-supervised method of feature selection is proposed. In the first step, gradient boosting classifier is used for labeling the unlabeled portion of the data and augment the training set. Repeated sampling is done from the unlabeled portion, to generate multiple augmented training sets. Top-k features are selected based on mutual information from each augmented training set. While selecting the features, it is ensured that the features are not redundant using a correlation coefficient. A voting-based approach is used to combine these multiple feature sets. The proposed method is compared with a) Supervised Feature Selection on the full dataset (Benchmark) and b) Supervised Feature Selection on the labeled portion. On comparing these three methods across 18 datasets, it was found that semisupervised feature selection outperforms the supervised model based on F1 scores by 2.78% and 2.63% in two different configurations. Also, the model outperforms the benchmark by 0.36% and 1.12%.
10 May 2024Submitted to TechRxiv
17 May 2024Published in TechRxiv