Chapter 11. Word alignment in the Russian-Chinese parallel corpus
-
Anastasia Politova
Abstract
The Russian-Chinese parallel corpus (RuZhCorp) was created in 2016 by sinologists and computational linguists. So far, it has accumulated 1 074 texts and over 4.6 million words that are aligned on a sentence level. To produce word alignment for the entire corpus, we used deep neural networks trained both on the whole RuZhCorp and on a manually aligned at a word level gold dataset. Using the principles presented in previous publications, we compiled the first word-to-word alignment guideline for the Russian-Chinese language pair, which makes the manual alignment process less ambiguous and more consistent. The joint fine-tuning of the LaBSE deep learning model on RuZhCorp and the gold dataset achieved the best AER of 18.9%.
Abstract
The Russian-Chinese parallel corpus (RuZhCorp) was created in 2016 by sinologists and computational linguists. So far, it has accumulated 1 074 texts and over 4.6 million words that are aligned on a sentence level. To produce word alignment for the entire corpus, we used deep neural networks trained both on the whole RuZhCorp and on a manually aligned at a word level gold dataset. Using the principles presented in previous publications, we compiled the first word-to-word alignment guideline for the Russian-Chinese language pair, which makes the manual alignment process less ambiguous and more consistent. The joint fine-tuning of the LaBSE deep learning model on RuZhCorp and the gold dataset achieved the best AER of 18.9%.
Chapters in this book
- Prelim pages i
- Table of contents v
- Cross-linguistic research and corpora 1
- Chapter 1. Light Verb Constructions as a testing ground for the Gravitational Pull Hypothesis 12
- Chapter 2. Light Verb Constructions in English-Spanish translation 34
- Chapter 3. Reporting direct speech in Spanish and German 51
- Chapter 4. “Ich bekomme es erklärt” 67
- Chapter 5. Exploring near-synonyms through translation corpora 91
- Chapter 6. run away! 108
- Chapter 7. Film dialogue synchronization and statistical dubbese 124
- Chapter 8. Opera audio description in the spoken-written language continuum 142
- Chapter 9. Using a multilingual parallel corpus for Journalistic Translation Research 157
- Chapter 10. Domain-adapting and evaluating machine translation for institutional German in South Tyrol 179
- Chapter 11. Word alignment in the Russian-Chinese parallel corpus 195
- Chapter 12. Building corpus-based writing aids from Spanish into English 216
- Index 235
Chapters in this book
- Prelim pages i
- Table of contents v
- Cross-linguistic research and corpora 1
- Chapter 1. Light Verb Constructions as a testing ground for the Gravitational Pull Hypothesis 12
- Chapter 2. Light Verb Constructions in English-Spanish translation 34
- Chapter 3. Reporting direct speech in Spanish and German 51
- Chapter 4. “Ich bekomme es erklärt” 67
- Chapter 5. Exploring near-synonyms through translation corpora 91
- Chapter 6. run away! 108
- Chapter 7. Film dialogue synchronization and statistical dubbese 124
- Chapter 8. Opera audio description in the spoken-written language continuum 142
- Chapter 9. Using a multilingual parallel corpus for Journalistic Translation Research 157
- Chapter 10. Domain-adapting and evaluating machine translation for institutional German in South Tyrol 179
- Chapter 11. Word alignment in the Russian-Chinese parallel corpus 195
- Chapter 12. Building corpus-based writing aids from Spanish into English 216
- Index 235