This paper proposes a learning approach for discovering the semantic structure of web pages. The task includes partitioning the text on a web page into information blocks and identifying their semantic categories. We employed two machine learning techniques, Adaboost and SVMs, to learn from a labeled web page corpus. We evaluated our approach on general web pages from the World Wide Web and obtained encouraging results. This work can be beneficial to a number of web-driven applications such as search engines, web-based question answering, web-based data mining as well as voice enabled web navigation.
Citation:
Junlan Feng, Patrick Haffner, Mazin Gilbert, "A Learning Approach to Discovering Web Page Semantic Structures," icdar, pp.1055-1059, Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005