Classificação multi-rótulo hierárquica de documentos textuais

AUTOR(ES)
FONTE

IBICT - Instituto Brasileiro de Informação em Ciência e Tecnologia

DATA DE PUBLICAÇÃO

29/07/2009

RESUMO

The amount of information stored in text databases is steadily increasing. As such, demand for automated techniques to organize this data also continues to grow. In this context, academic and industry research has been focused on the study of automatic text classification. Most work on text classification studies the development of techniques in which there are a limited number of classes and dependencies between them is not significant. There are several relevant application scenarios in which these assumptions are not valid. To solve these problems, a new research topic, the Multi-label Hierarchical Classification (HMC) has received more attention but still represents a major challenge for the area. In HMC problems, the set of classes is likely to be much greater and, as such, they are hierarchically structured. Classic methods, in addition to ignore the existing structure knowledge, have their performance degradated if the number of classes is too large or interdependence between the classes exists. In this work we perform an extensive literature study, present a framework targeting development and analysis of HMC algorithms, the MASSIFICA, and propose a lazy classification rule-based algorithm suitable for HMC problems. MASSIFICA was used as benchmark to evaluate performance of a proposed algorithm against well known base classifers based on both fat architecture and structured database (topdown) architectures. We also present results applied to a real application scenario: classification of companies economic activities. Finally, we discuss challenges and how diferent solutions react to these challenges. We conclude that the new algorithm, despite having a lower performance in the first hierarchical levels, can perform competitively, particularly in the deeper levels of the hierarchy, which in general classes are uncommon and less information is provided.

ASSUNTO(S)

computação teses. mineração de dados (computação) teses sistemas de recuperação da informação teses.

Documentos Relacionados