Startseite Dutch compound splitting for bilingual terminology extraction
Kapitel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Dutch compound splitting for bilingual terminology extraction

  • Lieve Macken und Arda Tezcan
Weitere Titel anzeigen von John Benjamins Publishing Company

Abstract

As compounds pose a problem for applications that rely on precise word alignments, we developed a state-of-the-art compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists.

As compounds are not always translated compositionally, we developed a novel methodology for word alignment. We train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts.

Experiments show that the compound splitter combined with the novel word alignment technique considerably improves bilingual terminology extraction results.

Abstract

As compounds pose a problem for applications that rely on precise word alignments, we developed a state-of-the-art compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists.

As compounds are not always translated compositionally, we developed a novel methodology for word alignment. We train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts.

Experiments show that the compound splitter combined with the novel word alignment technique considerably improves bilingual terminology extraction results.

Heruntergeladen am 7.9.2025 von https://www.degruyterbrill.com/document/doi/10.1075/cilt.341.07mac/pdf
Button zum nach oben scrollen