John Benjamins Publishing Company
Automatic extraction of terminological translation lexicon from Czech-English parallel texts
-
and
Abstract
We present experimental results of an automatic extraction of a Czech-English translation dictionary. Two different bilingual corpora (119,886 sentence pairs computer-oriented and 58,137 journalistic corpora) were created. We used the length-based statistical method for sentence alignment (Gale and Church 1991) and noun phrase marker working with regular grammar and probabilistic model (Brown et al. 1993) for dictionary extraction. Resulting dictionaries' size varies around 6,000 entries. After significance filtering, weighted precision is 86.4%for computer-oriented and 70.7%for journalistic Czech-English dictionary.
Abstract
We present experimental results of an automatic extraction of a Czech-English translation dictionary. Two different bilingual corpora (119,886 sentence pairs computer-oriented and 58,137 journalistic corpora) were created. We used the length-based statistical method for sentence alignment (Gale and Church 1991) and noun phrase marker working with regular grammar and probabilistic model (Brown et al. 1993) for dictionary extraction. Resulting dictionaries' size varies around 6,000 entries. After significance filtering, weighted precision is 86.4%for computer-oriented and 70.7%for journalistic Czech-English dictionary.
Chapters in this book
- Prelim pages i
- Table of contents v
- Preface vii
- Automatic extraction of terminological translation lexicon from Czech-English parallel texts 1
- Words from Bononia Legal Corpus 11
- Hybrid approaches for automatic segmentation and annotation of a Chinese text corpus 31
- Distance between languages as measured by the minimal-entropy model 39
- The importance of the syntagmatic dimension in the multilingual lexical database 49
- Compiling parallel text corpora 59
- Data-derived multilingual lexicons 69
- Bridge dictionaries as bridges between languages 83
- Procedures in building the Croatian-English parallel corpus 93
- Corpus linguistics and lexicography* 109
- Analysing the fluency of translators 135
- Equivalence and non-equivalence in parallel corpora* 147
- Index 157
Chapters in this book
- Prelim pages i
- Table of contents v
- Preface vii
- Automatic extraction of terminological translation lexicon from Czech-English parallel texts 1
- Words from Bononia Legal Corpus 11
- Hybrid approaches for automatic segmentation and annotation of a Chinese text corpus 31
- Distance between languages as measured by the minimal-entropy model 39
- The importance of the syntagmatic dimension in the multilingual lexical database 49
- Compiling parallel text corpora 59
- Data-derived multilingual lexicons 69
- Bridge dictionaries as bridges between languages 83
- Procedures in building the Croatian-English parallel corpus 93
- Corpus linguistics and lexicography* 109
- Analysing the fluency of translators 135
- Equivalence and non-equivalence in parallel corpora* 147
- Index 157