loading...
Semi-Automated Extraction of Targeted Data fromWeb Pages
Atlanta, Georgia April 03-April 07
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDEW.2006.13522nd International Conference on Data ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fabrice Estievenart, CETIC, Belgium
Jean-Roch Meurisse, University of Namur, Belgium
Jean-Luc Hainaut, University of Namur, Belgium
Philippe Thiran, University of Namur, Belguim
TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way.

In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.

Citation:
Fabrice Estievenart, Jean-Roch Meurisse, Jean-Luc Hainaut, Philippe Thiran, "Semi-Automated Extraction of Targeted Data fromWeb Pages," icdew, pp.48, 22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.


Suggestions