Comparação de metodologias aplicadas à analise de agrupamentos na presença de variáveis categóricas e contínuas

Renata Assis de Matos

Cluster Analysis is the name given to a group of several types of algorithms used to organize objects into groups taking into account the proximity that exists between them. Objects in the same group are as similar as possible to each other (internal cohesion) and are as dissimilar as possible to the objects in the different groups (external isolation). Cluster procedures are based upon two components: the proximity measure and the algorithm. Despite of the your wide applicability of these methods, the majority of the studies published in the literature focus on continuous variables. More recently attention has been given to new algorithms that can incorporate the information of ategorical variables. However, the recent papers do not compare these new methods in a proper way and the existence of different possibilities difficult the choice of the better method. In this dissertation a comparative study is performed. Five algorithms which are applicable for categorical variables and three which are applicable for both types of variables are examined. Among these last three algorithms, the extension of ROCK, which allows to cluster objects by using both types of variables is a new proposal of this dissertation. Besides that, it is also evaluated the influence of cluster overlapping, the number of groups, variables and categories, the correlation between the continuous variable and the choice of the weights of the combined proximity measure, that is used when the objects are clustered using the two types of variables. Based on the results of this dissertation it can be concluded that when the number of groups increase, independent of their structure, the performance of the clustering algorithms decreased. The effect of the increase of the number of variables and categories depends on the internal structure of the clusters. It was also noticed that the correlation between the continuous variables does not cause any effect on the percentage of correct classification and that the clustering methods have better results when in the combined proximity measure more weight is given to the continuous variables. In terms of efficiency, the ROCK algorithm had better performance all simulation studies of this dissertation. Keywords: Cluster analysis, categorical variables, continuous variables, Average Linkage,ROCK, k-Modes, k-Prototypes, Fuzzy c-Modes, k-Populations

Comparação de metodologias aplicadas à analise de agrupamentos na presença de variáveis categóricas e contínuas

AUTOR(ES)

DATA DE PUBLICAÇÃO

RESUMO

ASSUNTO(S)

ACESSO AO ARTIGO

Documentos Relacionados