loading...
A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques
Seoul, Korea August 31-September 01
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDAR.2005.6Eighth International Conference on Do ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Stoyan Mihov, IPP - Bulgarian Academy of Sciences, Sofia
Klaus U. Schulz, CIS, University of Munich
Christoph Ringlstetter, CIS, University of Munich
Veselka Dojchinova, IPP - Bulgarian Academy of Sciences, Sofia
Vanja Nakova, IPP - Bulgarian Academy of Sciences, Sofia
We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major part of the corpus (2306 files) consists of Bulgarian documents. Many of these documents come with Cyrillic and Latin symbols. A smaller corpus with German documents has been added. All original documents represent real-life paper documents collected from enterprises and organizations. Most genres of written language and various document types are covered. The corpus contains the corresponding image files, rich meta-data, textual files obtained via OCR recognition, ground truth data for hundreds of example pages, and alignment software for experiments.
Index Terms:
Optical character recognition, postcorrection of OCR results, public corpora, comparative evaluation, ground truth data, Cyrillic documents, mixed-alphabet documents, meta-data.
Citation:
Stoyan Mihov, Klaus U. Schulz, Christoph Ringlstetter, Veselka Dojchinova, Vanja Nakova, "A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques," icdar, pp.162-166, Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.