loading...
Webpage Genre Identification Using Variable-Length Character n-Grams
Paris, France October 29-October 31
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICTAI.2007.10719th IEEE International Conference on ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An important factor for discriminating between webpages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Webpage genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of webpages including word and html-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character n- grams and combine this representation with information about the most frequent html-tags. Based on two benchmark corpora, we present webpage genre identification experiments and improve the best reported results in both cases.
Citation:
Ioannis Kanaris, Efstathios Stamatatos, "Webpage Genre Identification Using Variable-Length Character n-Grams," ictai, vol. 2, pp.3-10, 19th IEEE International Conference on Tools with Artificial Intelligence - Vol.2 (ICTAI 2007), 2007
Usage of this product signifies your acceptance of the Terms of Use.