Construção de evidências para classificação automática de textos
AUTOR(ES)
Fabio Soares Figueiredo
DATA DE PUBLICAÇÃO
2008
RESUMO
Since the popularization of digital documents, automatic text classification is considered an important research topic. Despite the research efforts, there is still a demand for improving the performance of classifiers. Most of the research in automatic text classification focus on the algorithmic side, but there are few efforts focused on enhancing the datasets used for training the automatic text classifiers, which is the focus of this paper. We propose a data treatment strategy, based on feature extraction, that precedes the classification task, in order to enhance documents with discriminative features of each class capable of increasing the classification effectiveness. Our strategy is based on term co-occurrences to generate new discriminative features, called compound-features (or c-features), that can be incorporated to documents to help the classification task. The idea is that, when used in conjunction with single-features, the ambiguity and noise inherent to c-features components are reduced, therefore making them more helpful to separate classes into more homogeneous partitions. However, the computational cost of feature extaction may make the method unfeasible. In this paper, we devise a set of mechanisms that make the strategy computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and standard text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as k-Nearest Neighbors (kNN) (46% gain in micro average F1 in the 20 Newsgroups 18828 collection) to the most complex one, the state of the art Support Vector Machine (SVM) (10,7% gain in macro average F1 in the collection OHSUMED).
ASSUNTO(S)
computação teses. classificação teses world wide web (sistema de recuperação da informação) teses. processamento da linguagem natural (computação) teses.
ACESSO AO ARTIGO
http://hdl.handle.net/1843/RVMR-7L3NSYDocumentos Relacionados
- OntoLP: construção semi-automática de ontologias a partir de textos da lingua portuguesa
- A STUDY OF MULTILABEL TEXT CLASSIFICATION ALGORITHMS USING NAIVE-BAYES
- Tipologia de traços linguísticos de textos do português do Brasil dos séculos XVI, XVII, XVIII e XIX: uma proposta para a classificação automática de gêneros textuais
- Uso de Seleção de Características da Wikipedia na Classificaçao Automatica de Textos
- Uso de Seleção de Características da Wikipedia na Classificaçao Automatica de Textos