loading...
OCR with No Shape Training
Barcelona, Spain September 03-September 08
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICPR.2000.90285815th International Conference on Patt ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Tin Kam Ho, Bell Labs, Lucent Technologies
George Nagy, Rensselaer Polytechnic Institute
We present a document-specific OCR system and apply it to a corpus of faxed business letters. Unsupervised classification of the segmented character bitmaps on each page, using a “clump” metric, typically yields several hundred clusters with highly skewed populations. Maximizing matches with a lexicon of English words assign letter identities to each cluster. We found that for 2/3 of the pages, we can identify almost 80% of the words included in the lexicon, without any shape training. Residual errors are caused by mis-segmentation including missed lines and punctuation. This research differs from earlier attempts to apply cipher decoding to OCR in (1) using real data (2) a more appropriate clustering algorithm, and (3) decoding a many-to-many instead of a one-to-one mapping between clusters and letters.
Citation:
Tin Kam Ho, George Nagy, "OCR with No Shape Training," icpr, vol. 4, pp.4027, 15th International Conference on Pattern Recognition (ICPR'00) - Volume 4, 2000
Usage of this product signifies your acceptance of the Terms of Use.