loading...
Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering
Seoul, Korea August 31-September 01
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDAR.2005.242Eighth International Conference on Do ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Maurizio Rigamonti, DIVA group, University of Fribourg, Switzerland
Jean-Luc Bloechle, DIVA group, University of Fribourg, Switzerland
Karim Hadjar, DIVA group, University of Fribourg, Switzerland
Denis Lalanne, DIVA group, University of Fribourg, Switzerland
Rolf Ingold, DIVA group, University of Fribourg, Switzerland
This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original document layout structure. Xed mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in a hierarchical canonical form, i.e. which is universal and independent of the document type. This article first reviews the major traps and tricks of the PDF format. It then introduces the architecture of Xed along with its main modules, and, in particular, the document physical structure extraction algorithm. Later on, a canonical format is proposed and discussed with an example. Finally the results of a practical evaluation are presented, followed by an outline of future works on the logical structure extraction.
Citation:
Maurizio Rigamonti, Jean-Luc Bloechle, Karim Hadjar, Denis Lalanne, Rolf Ingold, "Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering," icdar, pp.1050-1055, Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.