In this paper we describe a method for the expansion of training sets made by XY trees representing page layout. This approach is appropriate when dealing with page classification based on MXY tree page representations. The basic idea is the use of tree grammars to model the variations in the tree which are caused by segmentation algorithms. A set of general grammatical rules are defined and used to expand the training set. Pages are classified with a k - nn approach where the distance between pages is computed by means of tree-edit distance.
Citation:
Stefano Baldi, Simone Marinai, Giovanni Soda, "Using tree-grammars for training set expansion in page classification," icdar, vol. 2, pp.829, Seventh International Conference on Document Analysis and Recognition (ICDAR'03) - Volume 2, 2003