loading...
ClusTex: Information Extraction from HTML Pages
Niagara Falls, Ontario, Canada May 21-May 23
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/AINAW.2007.11921st International Conference on Adva ...
 This Article 
 
PDF
HTML
IEEE Xplore Subscribers
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fatima Ashraf, University of Calgary, Canada
Reda Alhajj, University of Calgary, Canada; Global University, Lebanon
This paper propose ClusTex, a system which employs clustering techniques for automatic information extraction from HTML documents containing semi-structured data. Using domain-specific information provided by the user, ClusTex parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters, and finally the output is reported. To demonstrate the effectiveness of this approach, the proposed approach is tested by conducting experiments on the University of Calgary web-site; the results prove comparable to those reported in the literature.
Index Terms:
information extraction, clustering, web pages, HTML documents.
Citation:
Fatima Ashraf, Reda Alhajj, "ClusTex: Information Extraction from HTML Pages," ainaw, vol. 1, pp.355-360, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07), 2007
Usage of this product signifies your acceptance of the Terms of Use.