MINERAÇÃO DE TEXTOS NA COLETA INTELIGENTE DE DADOS NA WEB / TEXT MINING AT THE INTELLIGENT WEB CRAWLING PROCESS

AUTOR(ES)
DATA DE PUBLICAÇÃO

2008

RESUMO

This dissertation presents a study about the application of Text Mining as part of the intelligent Web crawling process. The most usual way of gathering data in Web consists of the utilization of web crawlers. Web crawlers are softwares that, once provided with an initial set of URLs (seeds), start the methodical proceeding of visiting a site, store it in disk and extract its hyperlinks that will be used for the next visits. But seeking for content in this way is an expensive and exhausting task. An intelligent web crawling process, more than collecting and storing any web document available, analyses its available crawling possibilities for finding links that, probably, will provide high relevant content to a topic defined a priori. In the approach suggested in this work, topics are not defined by words, but rather by the employment of text documents as examples. Next, pre-processing techniques used in Text Mining, including the use of a Thesaurus, analyze semantically the document submitted as example. Based on this analysis, the web crawler thus constructed will be guided toward its objective: retrieve relevant information to the document. Starting from seeds or querying through available search engines, the crawler analyzes, exactly as in the previous step, every document retrieved in Web. the similarity level between them is obtained, the retrieved document`s hyperlinks are analysed, queued and, later, will be dequeued according to each one`s probable degree of importance. By the end of the gathering data process, another Text Mining technique is applied, with the propose of selecting the most representative document among the collected texts: Document Clustering. The implementation of a tool incorporating all the researched heuristics allowed to achieve results, making possible to evaluate the performance of the developed techniques and compare all obtained results with others means of retrieving data in Web. The present work shows that the use of Text Mining is a track worthy to be exploited in the process of retrieving relevant information in Web.

ASSUNTO(S)

data mining coleta de dados data retrieval web crawling mineracao de dados recuperacao de informacao web crawling information retrieval

Documentos Relacionados