Home General Interest Multiword expressions in multilingual information extraction
Chapter
Licensed
Unlicensed Requires Authentication

Multiword expressions in multilingual information extraction

  • Gregor Thurmair
View more publications by John Benjamins Publishing Company

Abstract

Multilingual Information Extraction requires significant Multiword Expressions (MWE) processing as many such items are multiwords. The lexical representation of MWEs supports large bilingual lexicons (for Persian, Pashto, Turkish, Arabic); multiwords are represented like single words, extended by two annotations: MWE head, and lemma plus part of speech for the MWE parts. In text analysis, MWEs are recognised as part of the parsing process, mot as pre- or post-processing components. The analysis design extends the X-bar scheme by a level for multiword rules. In transfer, MWEs are translated as elementary nodes like single word lemmata, to present key concepts for relevance judgement in Information Extraction. Evaluation shows that 90% of the MWE patterns in the lexicon can be analysed with about 150 MWE-specific rules, and that more than 90% of text document tokens are covered by the proposed integrated single and multiword processing.

Abstract

Multilingual Information Extraction requires significant Multiword Expressions (MWE) processing as many such items are multiwords. The lexical representation of MWEs supports large bilingual lexicons (for Persian, Pashto, Turkish, Arabic); multiwords are represented like single words, extended by two annotations: MWE head, and lemma plus part of speech for the MWE parts. In text analysis, MWEs are recognised as part of the parsing process, mot as pre- or post-processing components. The analysis design extends the X-bar scheme by a level for multiword rules. In transfer, MWEs are translated as elementary nodes like single word lemmata, to present key concepts for relevance judgement in Information Extraction. Evaluation shows that 90% of the MWE patterns in the lexicon can be analysed with about 150 MWE-specific rules, and that more than 90% of text document tokens are covered by the proposed integrated single and multiword processing.

Downloaded on 29.12.2025 from https://www.degruyterbrill.com/document/doi/10.1075/cilt.341.05thu/html
Scroll to top button