loading...
Evaluating OCR and Non-OCR Text Representations for Learning Document Classifiers
Ulm, GERMANY August 18-August 20
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDAR.1997.620671Fourth International Conference Docum ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Markus Junker, German Research Center for Artificial Intelligence GmbH, Germany
Rainer Hoch, SAP AG, Basis Systems and Services, Germany
In literature, many feature types and learning algorithms are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not been done yet. In order to investigate different text representations for document classification, we have developed a tool which transforms documents into feature-value representations suitable for standard learning algorithms. In this paper we investigate seven document representations for German texts based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.
Citation:
Markus Junker, Rainer Hoch, "Evaluating OCR and Non-OCR Text Representations for Learning Document Classifiers," icdar, pp.1060, Fourth International Conference Document Analysis and Recognition (ICDAR'97), 1997
Usage of this product signifies your acceptance of the Terms of Use.


Suggestions