A FRAMEWORK FOR THE CONSTRUCTION OF MEDIATORS OFFERING DEDUPLICATION / UM FRAMEWORK PARA A CONSTRUÇÃO DE MEDIADORES OFERECENDO ELIMINAÇÃO DE DUPLICATAS

AUTOR(ES)
DATA DE PUBLICAÇÃO

2010

RESUMO

As Web applications that obtain data from different sources (Mashups) grow in importance, timely solutions to the duplicate detection problem become central. Most existing techniques, however, are based on machine learning algorithms, that heavily rely on the use of relevant, manually labeled, training datasets. Such solutions are not adequate when talking about data sources on the Deep Web, as there is often little information regarding the size, volatility and hardly any access to relevant samples to be used for training. In this thesis we propose a strategy to aid in the extraction (scraping), duplicate detection and integration of data that resulted from querying Deep Web resources. Our approach does not require the use of pre-defined training sets , but rather uses a combination of a Vector Space Model classifier with similarity functions, in order to provide a viable solution. To illustrate our approach, we present a case study where the proposed framework was instantiated for an application in the wine industry domain.

ASSUNTO(S)

web web integracao de dados framework data integration framework

Documentos Relacionados