Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados / Supervised learning Algorithm - Based on Support Vector Machines - A Contribution to the Recognition of Unbalanced Data

Hugo Leonardo Pereira Rufino

The machine learning in datasets that have unbalanced classes, has received considerable attention in the scientific community, because the traditional classification algorithms dont provide a satisfactory performance. This low performance can be explained by the fact that the traditional techniques of machine learning consider that each class present in the database has an approximately equal number of instances. However, most real datasets, have classes with an unbalanced distribution, where one class is over represented in comparison with the others. This gives rise to classifiers with high accuracy to predict the majority class and low accuracy for predicting the minority class. Therefore, the minority class is ignored by the classifier. This predisposition of the classifier for the majority class occurs, because the classifiers are designed to maximize accuracy in relation to the database being used for training. In training the classifier, it is assumed that when making the prediction of data not yet seen, they have the same distribution of the data that were used in training. This limits its ability to recognize examples of the minority class. Several improvements in the traditional classification algorithms have been proposed in the literature, where considerations were made at the level of data and algorithms. The former uses various ways of resampling, such as oversampling of examples from the minority class, undersampling the majority class or a combination of both. The latter attempt to adapt (by inserting dierent costs in the minority class examples and majority, changing kernels and other techniques) the existing classification algorithms to improve the performance of minority class. Several algorithms in the form of a ensemble machine, are also reported as meta-techniques for working with unbalanced classes. This thesis studies the main algorithms that deal with unbalanced class, highlighting its main features as: the generation of new synthetic examples instead of replicating data at random, in the process of oversampling; the use of dierent penalties to misclassification of the minority and majority class; and the use of ensembles for that the generated classifiers have a greater ability to generalize. After assessing the contributions that each algorithm provides, a study was done if one could get something more of the characteristics of each one. It was made a modification in the algorithm that generates new synthetic examples of way that reduces the possibility of generating new elements in the incorrect region. As with highly unbalanced datasets, the generation of synthetic elements is not enough to balance the whole, there was a need to develop a new algorithm to perform an undersampling the majority class examples. And to enhance the generalization ability of the generated classifier, was also made a change to an ensemble algorithm. Using these three steps, we obtained an compound algorithm that has a hit rate of data classification better than the algorithms on which it was relied.

Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados / Supervised learning Algorithm - Based on Support Vector Machines - A Contribution to the Recognition of Unbalanced Data

AUTOR(ES)

FONTE

DATA DE PUBLICAÇÃO

RESUMO

ASSUNTO(S)

ACESSO AO ARTIGO

Documentos Relacionados