MINERAÇÃO DE TEXTOS NA COLETA INTELIGENTE DE DADOS NA WEB / TEXT MINING AT THE INTELLIGENT WEB CRAWLING PROCESS
AUTOR(ES)
FABIO DE AZEVEDO SOARES
DATA DE PUBLICAÇÃO
2008
RESUMO
This dissertation presents a study about the application of Text Mining as part of the intelligent Web crawling process. The most usual way of gathering data in Web consists of the utilization of web crawlers. Web crawlers are softwares that, once provided with an initial set of URLs (seeds), start the methodical proceeding of visiting a site, store it in disk and extract its hyperlinks that will be used for the next visits. But seeking for content in this way is an expensive and exhausting task. An intelligent web crawling process, more than collecting and storing any web document available, analyses its available crawling possibilities for finding links that, probably, will provide high relevant content to a topic defined a priori. In the approach suggested in this work, topics are not defined by words, but rather by the employment of text documents as examples. Next, pre-processing techniques used in Text Mining, including the use of a Thesaurus, analyze semantically the document submitted as example. Based on this analysis, the web crawler thus constructed will be guided toward its objective: retrieve relevant information to the document. Starting from seeds or querying through available search engines, the crawler analyzes, exactly as in the previous step, every document retrieved in Web. the similarity level between them is obtained, the retrieved document`s hyperlinks are analysed, queued and, later, will be dequeued according to each one`s probable degree of importance. By the end of the gathering data process, another Text Mining technique is applied, with the propose of selecting the most representative document among the collected texts: Document Clustering. The implementation of a tool incorporating all the researched heuristics allowed to achieve results, making possible to evaluate the performance of the developed techniques and compare all obtained results with others means of retrieving data in Web. The present work shows that the use of Text Mining is a track worthy to be exploited in the process of retrieving relevant information in Web.
ASSUNTO(S)
data mining coleta de dados data retrieval web crawling mineracao de dados recuperacao de informacao web crawling information retrieval
ACESSO AO ARTIGO
Documentos Relacionados
- DESENVOLVIMENTO DE UMA METODOLOGIA PARA MINERAÇÃO DE TEXTOS
- Análise de dados por meio de agrupamento fuzzy semi-supervisionado e mineração de textos
- Instrumentação inteligente via web services.
- SISTEMAS INTELIGENTES PARA TEXTOS DA WEB
- Avaliação de métodos não-supervisionados de seleção de atributos para mineração de textos