loading...
Language identification of on-line documents using word shapes
Ulm, GERMANY August 18-August 20
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDAR.1997.619852Fourth International Conference Docum ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
N. Nobile, Centre for Pattern Recognition & Machine Intelligence, Concordia Univ., Montreal, Que., Canada
S. Bergler, Centre for Pattern Recognition & Machine Intelligence, Concordia Univ., Montreal, Que., Canada
C.Y. Suen, Centre for Pattern Recognition & Machine Intelligence, Concordia Univ., Montreal, Que., Canada
S. Khoury, Centre for Pattern Recognition & Machine Intelligence, Concordia Univ., Montreal, Que., Canada
The authors have extended existing methods to identify the language of an on-line document after the characters have been coded using 10 character classes based on visual characteristics. In particular, they exploit word bigrams and trigrams in both a linear combination of score values and an expert systems approach. Knowledge about each language as acquired from a large number of on-line texts. Using a small set of rules, the expert system outperforms the linear combination in accuracy and shows more stability when parameter settings are varied.
Index Terms:
identification; language identification; on-line documents; word shapes; coded characters; character classes; visual characteristics; word bigrams; word trigrams; linear score value combination; expert system; knowledge acquisition; on-line texts; rules; accuracy; stability; varied parameter settings
Citation:
N. Nobile, S. Bergler, C.Y. Suen, S. Khoury, "Language identification of on-line documents using word shapes," icdar, pp.258, Fourth International Conference Document Analysis and Recognition (ICDAR'97), 1997
Usage of this product signifies your acceptance of the Terms of Use.