Word segmentation is the most fundamental and impor- tant process for Japanese or Chinese language processing. Because there is no separation between words in these lan- guages, we firstly have to separate the sequence into words. On this problem, it is known that the approach by proba- bilistic language model is highly efficient, and this is shown practically. On the other hand, recently, a word-valued source has been proposed as a new class of source model for the source coding problem. This model can be supposed to reflect more of the probability structure of natural lan- guages. We may regard Japanese sentence or Chinese sen- tence as the sequence emitting from a non-prefix-free WVS. In this paper, as the first phase of applying WVS to natu- ral language processing, we formulate a word segmentation problem for the sequence from non-prefix-free WVS. Then, we examine the performance of word segmentation for the models by numerical computations.
Citation:
Takashi Ishida, Toshiyasu Matsushima, Shigeichi Hirasawa, "Word Segmentation for the Sequences Emitted from a Word-Valued Source," cit, pp.662-661, 7th IEEE International Conference on Computer and Information Technology (CIT 2007), 2007