Home Chapter 2. ReGap
Chapter
Licensed
Unlicensed Requires Authentication

Chapter 2. ReGap

A text-preprocessing algorithm to enhance MWE‑aware neural machine translation systems
  • Carlos Manuel Hidalgo-Ternero and Gloria Corpas Pastor
View more publications by John Benjamins Publishing Company

Abstract

This research presents ReGap, a text-preprocessing algorithm for the automatic token-based identification and conversion of discontinuous multiword expressions (MWEs) into their canonical state, i.e., their continuous form, as a means to optimise neural machine translation (NMT) systems. To this end, an experiment with flexible verb-noun idiomatic constructions (VNICs) is conducted in order to assess to what extent ReGap can enhance the performance of the most robust NMT system to date, DeepL, under the challenge of MWE discontinuity in the Spanish-into-English and the Spanish-into-German directionalities. In this regard, the promising results yielded for VNICs will shed some light on new avenues for enhancing MWE‑aware NMT systems.

Abstract

This research presents ReGap, a text-preprocessing algorithm for the automatic token-based identification and conversion of discontinuous multiword expressions (MWEs) into their canonical state, i.e., their continuous form, as a means to optimise neural machine translation (NMT) systems. To this end, an experiment with flexible verb-noun idiomatic constructions (VNICs) is conducted in order to assess to what extent ReGap can enhance the performance of the most robust NMT system to date, DeepL, under the challenge of MWE discontinuity in the Spanish-into-English and the Spanish-into-German directionalities. In this regard, the promising results yielded for VNICs will shed some light on new avenues for enhancing MWE‑aware NMT systems.

Downloaded on 13.9.2025 from https://www.degruyterbrill.com/document/doi/10.1075/cilt.366.02hid/pdf
Scroll to top button