Construção de evidências para classificação automática de textos

Fabio Soares Figueiredo

Since the popularization of digital documents, automatic text classification is considered an important research topic. Despite the research efforts, there is still a demand for improving the performance of classifiers. Most of the research in automatic text classification focus on the algorithmic side, but there are few efforts focused on enhancing the datasets used for training the automatic text classifiers, which is the focus of this paper. We propose a data treatment strategy, based on feature extraction, that precedes the classification task, in order to enhance documents with discriminative features of each class capable of increasing the classification effectiveness. Our strategy is based on term co-occurrences to generate new discriminative features, called compound-features (or c-features), that can be incorporated to documents to help the classification task. The idea is that, when used in conjunction with single-features, the ambiguity and noise inherent to c-features components are reduced, therefore making them more helpful to separate classes into more homogeneous partitions. However, the computational cost of feature extaction may make the method unfeasible. In this paper, we devise a set of mechanisms that make the strategy computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and standard text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as k-Nearest Neighbors (kNN) (46% gain in micro average F1 in the 20 Newsgroups 18828 collection) to the most complex one, the state of the art Support Vector Machine (SVM) (10,7% gain in macro average F1 in the collection OHSUMED).

Construção de evidências para classificação automática de textos

AUTOR(ES)

DATA DE PUBLICAÇÃO

RESUMO

ASSUNTO(S)

ACESSO AO ARTIGO

Documentos Relacionados