A new metric for feature selection on short text datasets

View/ Open
Access
info:eu-repo/semantics/closedAccessDate
2022Access
info:eu-repo/semantics/closedAccessMetadata
Show full item recordCitation
Cekik, R., & Uysal, A. K. (2022). A new metric for feature selection on short text datasets. Concurrency and Computation: Practice and Experience, e6909.Abstract
In recent years, short texts are everywhere, especially in social media networks. Short
text classification is an essential task for various applications related to the operations on short text documents. In many cases, using the entire feature set causes the
high dimensionality problem in short text data. This problem reason of time-consuming
and negatively impacts the performance of classifiers. This study presents an effective feature selection algorithm called XY method, which represents the features on
XY line and calculates the distance of a feature to the XY line. Also, a value named
λ is calculated. According to this value, the terms are divided into different regions
such as negative, positive, and third to determine their discrimination capability. The
novel XY method aims to select as few terms as possible in the negative region. The
proposed method is evaluated using four different short text datasets with Macro-F1
success measure. In comparisons with other existing feature selection algorithms such
as chi-square, information gain, deviation from Poisson distribution, recently proposed
max-min ratio, and distinguishing feature selector demonstrate that the XY method
achieves either better or competitive performance in significantly reduced various
feature sizes.