Abstract: Statistical language models should improve as the size of the n-grams increases from 3 to 5 or higher.
However, the number of parameters and calculations, and the storage requirement increase very rapidly if we
attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams’ approach
previously developed by O’Boyle? 1993 can be applied. A reduced n-gram language model can store an entire
corpus’s phrase-history length within feasible storage limits. Another theoretical advantage of reduced n-grams is
that they are closer to being semantically complete than traditional models, which include all n-grams. In our
experiments, the reduced n-gram Zipf curves are first presented, and compared with conventional n-grams for all
Irish, Chinese and English. The reduced n-gram model is then applied for large Irish, Chinese and English
corpora. For Irish, we can reduce the model size, compared to the 7-gram traditional model size, with a factor of
15.1 for a 7-million-word Irish corpus while obtaining 41.63% improvement in perplexities; for English, we reduce
the model sizes with factors of 14.6 for a 40-million-word corpus and 11.0 for a 500-million-word corpus while
obtaining 5.8% and 4.2% perplexity improvements; and for Chinese, we gain a 16.9% perplexity reductions and
we reduce the model size by a factor larger than 11.2. This paper is a step towards the modeling of Irish, Chinese
and English using semantically complete phrases in an n-gram model.
Keywords: Reduced n-grams, Overlapping n-grams, Weighted average (WA) model, Katz back-off, Zipf’s law.
ACM Classification Keywords: I. Computing Methodologies - I.2 ARTIFICIAL INTELLIGENCE - I.2.7 Natural
Language Processing - Speech recognition and synthesis
Link:
MULTILINGUAL REDUCED N-GRAM MODELS
Tran Thi Thu Van and Le Quan Ha
http://www.foibg.com/ijitk/ijitk-vol04/ijitk04-2-p07.pdf