Chapter 4. Semantic textual similarity based on deep learning
-
Tharindu Ranasinghe
, Ruslan Mitkov , Constantin Orăsan and Rocío Caro Quintana
Abstract
This study proposes an original methodology to underpin the operation of new generation Translation Memory (TM) systems where the translations to be retrieved from the TM database are matched not on the basis of Levenshtein (edit) distance but by employing innovative Natural Language Processing (NLP) and Deep Learning (DL) techniques. Three DL sentence encoders were experimented with to retrieve TM matches in English-Spanish sentence pairs from the DGT TM dataset. Each sentence encoder was compared with Okapi which uses edit distance to retrieve the best match.1 The automatic evaluation shows the benefit of the DL technology for TM matching and holds promise for the implementation of the TM tool itself, which is our next project.
Abstract
This study proposes an original methodology to underpin the operation of new generation Translation Memory (TM) systems where the translations to be retrieved from the TM database are matched not on the basis of Levenshtein (edit) distance but by employing innovative Natural Language Processing (NLP) and Deep Learning (DL) techniques. Three DL sentence encoders were experimented with to retrieve TM matches in English-Spanish sentence pairs from the DGT TM dataset. Each sentence encoder was compared with Okapi which uses edit distance to retrieve the best match.1 The automatic evaluation shows the benefit of the DL technology for TM matching and holds promise for the implementation of the TM tool itself, which is our next project.
Chapters in this book
- Prelim pages i
- Table of contents v
- Corpus resources and tools 1
-
Part I. Corpus resources and tools
- Chapter 1. Now what ? 23
- Chapter 2. ZHEN 49
- Chapter 3. Word alignment in a parallel corpus of Old English prose 75
- Chapter 4. Semantic textual similarity based on deep learning 101
- Chapter 5. TAligner 3.0 125
- Chapter 6. Developing a corpus-informed tool for Spanish professionals writing specialised texts in English 147
-
Part II. Corpus-based studies and explorations
- Chapter 7. English and Spanish discourse markers in translation 177
- Chapter 8. The discourse markers well and so and their equivalents in the Portuguese and Turkish subparts of the TED-MDB corpus 209
- Chapter 9. Variation of evidential values in discourse domains 233
- Chapter 10. The translation for dubbing of Westerns in Spain 257
- Chapter 11. Generic analysis of mobile application reviews in English and Spanish 283
- Chapter 12. Exploring variation in translation with probabilistic language models 307
- Chapter 13. Binomial adverbs in Germanic and Romance Languages 325
- Index 343
Chapters in this book
- Prelim pages i
- Table of contents v
- Corpus resources and tools 1
-
Part I. Corpus resources and tools
- Chapter 1. Now what ? 23
- Chapter 2. ZHEN 49
- Chapter 3. Word alignment in a parallel corpus of Old English prose 75
- Chapter 4. Semantic textual similarity based on deep learning 101
- Chapter 5. TAligner 3.0 125
- Chapter 6. Developing a corpus-informed tool for Spanish professionals writing specialised texts in English 147
-
Part II. Corpus-based studies and explorations
- Chapter 7. English and Spanish discourse markers in translation 177
- Chapter 8. The discourse markers well and so and their equivalents in the Portuguese and Turkish subparts of the TED-MDB corpus 209
- Chapter 9. Variation of evidential values in discourse domains 233
- Chapter 10. The translation for dubbing of Westerns in Spain 257
- Chapter 11. Generic analysis of mobile application reviews in English and Spanish 283
- Chapter 12. Exploring variation in translation with probabilistic language models 307
- Chapter 13. Binomial adverbs in Germanic and Romance Languages 325
- Index 343