D. Doermann, Institute for Advanced Computer Studies, University of Maryland, College Park
H. Li, Institute for Advanced Computer Studies, University of Maryland, College Park
O. Kia, Institute for Advanced Computer Studies, University of Maryland, College Park
In this paper we propose and implement a method for detecting duplicate documents in very large image databases. The method is based on a robust "signature" extracted from each document image which is used to index into a table of previously processed documents. The approach has a number of advantages over OCR or other recognition based methods including speed and robustness to imaging distortions. To justify the approach and test the scalability, we have developed a simulator which allows us to change parameters of the system and examine performance for millions of document signatures. A complete system is implemented and tested on a test collection of technical articles and memos.
Citation:
D. Doermann, H. Li, O. Kia, "The Detection of Duplicates in Document Image Databases," icdar, pp.314, Fourth International Conference Document Analysis and Recognition (ICDAR'97), 1997