Arabic preprocessing for Statistical Machine Translation
-
Nizar Habash
and Fatiha Sadat
Abstract
Arabic is a morphologically rich language. This poses some problems for statistical machine translation (SMT) approaches. In this chapter, we study the effect of different Arabic word-level preprocessing schemes and techniques on the quality of phrase-based SMT. We also present and evaluate different methods for combining preprocessing schemes. Our results show that given large training data sets, splitting off proclitics only performs best. However, for small training data sets, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing scheme produces a significant increase in BLEU score if there is a change in genre between training and test data. We also found that combining different preprocessing schemes leads to improved translation quality.
Abstract
Arabic is a morphologically rich language. This poses some problems for statistical machine translation (SMT) approaches. In this chapter, we study the effect of different Arabic word-level preprocessing schemes and techniques on the quality of phrase-based SMT. We also present and evaluate different methods for combining preprocessing schemes. Our results show that given large training data sets, splitting off proclitics only performs best. However, for small training data sets, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing scheme produces a significant increase in BLEU score if there is a change in genre between training and test data. We also found that combining different preprocessing schemes leads to improved translation quality.
Chapters in this book
- Prelim pages i
- Table of contents v
- Preface vii
- Introduction 1
- Linguistic resources for Arabic machine translation 15
- Using morphology to improve Example-Based Machine Translation 23
- Using semantic equivalents for Arabic-to-English 49
- Arabic preprocessing for Statistical Machine Translation 73
- Preprocessing for English-to-Arabic Statistical Machine Translation 95
- Lexical syntax for Arabic SMT 109
- Automatic rule induction in Arabic to English machine translation framework 135
- Index 155
Chapters in this book
- Prelim pages i
- Table of contents v
- Preface vii
- Introduction 1
- Linguistic resources for Arabic machine translation 15
- Using morphology to improve Example-Based Machine Translation 23
- Using semantic equivalents for Arabic-to-English 49
- Arabic preprocessing for Statistical Machine Translation 73
- Preprocessing for English-to-Arabic Statistical Machine Translation 95
- Lexical syntax for Arabic SMT 109
- Automatic rule induction in Arabic to English machine translation framework 135
- Index 155