John Benjamins Publishing Company
Working with parallel corpora
Abstract
Although parallel corpora are vital for cross-linguistic and natural language processing (NLP) research, most have been designed for just one particular purpose, which may unnecessarily restrict their usefulness and usability. My argument is that the usefulness of existing parallel corpora increases exponentially when data so obtained are combined with those yielded by comparable and/or monolingual corpora. Usability criteria such as the choice of processing tools and adherence to international standards, among others, also have an impact on corpus usefulness. This chapter proposes courses of action that serve to improve the recycling and reprocessing of available resources. It also presents a corpus-based, post-editing and quality assessment application as an illustration of the multifarious uses parallel corpora may serve.
Abstract
Although parallel corpora are vital for cross-linguistic and natural language processing (NLP) research, most have been designed for just one particular purpose, which may unnecessarily restrict their usefulness and usability. My argument is that the usefulness of existing parallel corpora increases exponentially when data so obtained are combined with those yielded by comparable and/or monolingual corpora. Usability criteria such as the choice of processing tools and adherence to international standards, among others, also have an impact on corpus usefulness. This chapter proposes courses of action that serve to improve the recycling and reprocessing of available resources. It also presents a corpus-based, post-editing and quality assessment application as an illustration of the multifarious uses parallel corpora may serve.
Chapters in this book
- Prelim pages i
- Table of contents v
- Acknowledgments ix
- Parallel corpora in focus 1
-
Part I. Parallel corpora
- Comparable parallel corpora 19
- Living with parallel corpora 39
- Working with parallel corpora 57
- Innovations in parallel corpus alignment and retrieval 79
-
Part II. Parallel corpora
- InterCorp 93
- Corpus PaGeS 103
- Building EPTIC 123
- Enriching parallel corpora with multimedia and lexical semantics 141
- Discourse annotation in the MULTINOT corpus 159
- PEST 183
- Indexation and analysis of a parallel corpus using CQPweb 197
- P-ACTRES 2.0 215
- An overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus 233
-
Part III. Parallel corpora
- Strategies for building high quality bilingual lexicons from comparable corpora 251
- Discovering bilingual collocations in parallel corpora 267
- Normalization of shorthand forms in French text messages using word embedding and machine translation 281
- Index 299
Chapters in this book
- Prelim pages i
- Table of contents v
- Acknowledgments ix
- Parallel corpora in focus 1
-
Part I. Parallel corpora
- Comparable parallel corpora 19
- Living with parallel corpora 39
- Working with parallel corpora 57
- Innovations in parallel corpus alignment and retrieval 79
-
Part II. Parallel corpora
- InterCorp 93
- Corpus PaGeS 103
- Building EPTIC 123
- Enriching parallel corpora with multimedia and lexical semantics 141
- Discourse annotation in the MULTINOT corpus 159
- PEST 183
- Indexation and analysis of a parallel corpus using CQPweb 197
- P-ACTRES 2.0 215
- An overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus 233
-
Part III. Parallel corpora
- Strategies for building high quality bilingual lexicons from comparable corpora 251
- Discovering bilingual collocations in parallel corpora 267
- Normalization of shorthand forms in French text messages using word embedding and machine translation 281
- Index 299