Home Dutch compound splitting for bilingual terminology extraction
Chapter
Licensed
Unlicensed Requires Authentication

Dutch compound splitting for bilingual terminology extraction

  • Lieve Macken and Arda Tezcan
View more publications by John Benjamins Publishing Company

Abstract

As compounds pose a problem for applications that rely on precise word alignments, we developed a state-of-the-art compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists.

As compounds are not always translated compositionally, we developed a novel methodology for word alignment. We train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts.

Experiments show that the compound splitter combined with the novel word alignment technique considerably improves bilingual terminology extraction results.

Abstract

As compounds pose a problem for applications that rely on precise word alignments, we developed a state-of-the-art compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists.

As compounds are not always translated compositionally, we developed a novel methodology for word alignment. We train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts.

Experiments show that the compound splitter combined with the novel word alignment technique considerably improves bilingual terminology extraction results.

Downloaded on 7.9.2025 from https://www.degruyterbrill.com/document/doi/10.1075/cilt.341.07mac/html
Scroll to top button