Categorization of imaged documents is a useful technique for building document image based digital libraries. This paper investigates techniques to improve categorization accuracy on OCR text, particularly that of biomedical imaged documents. Experiments with different feature selection methods were run to explore their effect on the categorization performance. The result shows that document frequency is a good feature selection method in terms of eliminating OCR errors. Furthermore, our categorization scheme IMP that combines OCR text and electronic abstracts shows consistent improvement on the accuracy as compared to categorizing on either abstracts or OCR text alone.
Citation:
Linlin Li, Chew Lim Tan, "Improving OCR Text Categorization Accuracy with Electronic Abstracts," dial, pp.82-87, Second International Conference on Document Image Analysis for Libraries (DIAL'06), 2006