Dissimilarity fuctions analysis based on dynamic clustering for symbolic data

AUTOR(ES)
DATA DE PUBLICAÇÃO

2005

RESUMO

Symbolic Data Analysis (SDA) is a new domain in the area of knowledge discovery that aims to provide suitable methods for data described through multi-valued variables, where there are sets of categories, intervals, or weight (probability) distributions in the cells of the data tables. These new variables enable representing the variability and uncertainty present in data. In order to extend statistical and machine learning techniques to symbolic data, the definition of suitable distance measures that can handle this kind of data is very important. For this purpose, several distance measures have been proposed in the literature. However, no comparative study on their applicability to problems involving both boolean and modal symbolic data has been performed. The main contribution of this dissertation is to provide a comparative analysis and empirical evaluation of dissimilarity functions for symbolic data, for despite the importance of this kind of study, the issue has virtually not been addressed in the literature. Moreover, this work introduces new dissimilarity functions that can be applied in dynamic clustering of symbolic data. Dynamic cluster algorithms aim to obtain both a single partition in a fixed number of clusters and the identification of a suitable representation or prototype for each cluster by locally optimizing a criterion that measures the fitting between clusters and their corresponding representation. The experiments were carried out with benchmark data sets and two artificial interval data sets with different degrees of clustering difficulty for comparing the usefulness of the functions evaluated. The accuracy of the results was assessed by a external clustering index applied with an unsupervised cross validation framework for the real data sets, and by a Monte Carlo experiment for the artificial data sets. With the results obtained, it is possible to verify the usefulness of the dissimilarity functions to the different types of symbolic data (multivalued, multivalued ordinal, interval, modal data with the same support and different support), as well as identify the best function configuration. Statistical tests provided the support for the conclusions drawn

ASSUNTO(S)

ciencia da computacao anÃlise de dados simbÃlicos agrupamento dinÃmico dynamic clustering symbolic data analysis funÃÃes de dissimilaridade dissimilarity functions

Documentos Relacionados