Strategies for building high quality bilingual lexicons from comparable corpora
-
Pablo Gamallo Otero
Abstract
This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%.
Abstract
This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%.
Chapters in this book
- Prelim pages i
- Table of contents v
- Acknowledgments ix
- Parallel corpora in focus 1
-
Part I. Parallel corpora
- Comparable parallel corpora 19
- Living with parallel corpora 39
- Working with parallel corpora 57
- Innovations in parallel corpus alignment and retrieval 79
-
Part II. Parallel corpora
- InterCorp 93
- Corpus PaGeS 103
- Building EPTIC 123
- Enriching parallel corpora with multimedia and lexical semantics 141
- Discourse annotation in the MULTINOT corpus 159
- PEST 183
- Indexation and analysis of a parallel corpus using CQPweb 197
- P-ACTRES 2.0 215
- An overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus 233
-
Part III. Parallel corpora
- Strategies for building high quality bilingual lexicons from comparable corpora 251
- Discovering bilingual collocations in parallel corpora 267
- Normalization of shorthand forms in French text messages using word embedding and machine translation 281
- Index 299
Chapters in this book
- Prelim pages i
- Table of contents v
- Acknowledgments ix
- Parallel corpora in focus 1
-
Part I. Parallel corpora
- Comparable parallel corpora 19
- Living with parallel corpora 39
- Working with parallel corpora 57
- Innovations in parallel corpus alignment and retrieval 79
-
Part II. Parallel corpora
- InterCorp 93
- Corpus PaGeS 103
- Building EPTIC 123
- Enriching parallel corpora with multimedia and lexical semantics 141
- Discourse annotation in the MULTINOT corpus 159
- PEST 183
- Indexation and analysis of a parallel corpus using CQPweb 197
- P-ACTRES 2.0 215
- An overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus 233
-
Part III. Parallel corpora
- Strategies for building high quality bilingual lexicons from comparable corpora 251
- Discovering bilingual collocations in parallel corpora 267
- Normalization of shorthand forms in French text messages using word embedding and machine translation 281
- Index 299