Multiword expressions in multilingual information extraction
-
Gregor Thurmair
Abstract
Multilingual Information Extraction requires significant Multiword Expressions (MWE) processing as many such items are multiwords. The lexical representation of MWEs supports large bilingual lexicons (for Persian, Pashto, Turkish, Arabic); multiwords are represented like single words, extended by two annotations: MWE head, and lemma plus part of speech for the MWE parts. In text analysis, MWEs are recognised as part of the parsing process, mot as pre- or post-processing components. The analysis design extends the X-bar scheme by a level for multiword rules. In transfer, MWEs are translated as elementary nodes like single word lemmata, to present key concepts for relevance judgement in Information Extraction. Evaluation shows that 90% of the MWE patterns in the lexicon can be analysed with about 150 MWE-specific rules, and that more than 90% of text document tokens are covered by the proposed integrated single and multiword processing.
Abstract
Multilingual Information Extraction requires significant Multiword Expressions (MWE) processing as many such items are multiwords. The lexical representation of MWEs supports large bilingual lexicons (for Persian, Pashto, Turkish, Arabic); multiwords are represented like single words, extended by two annotations: MWE head, and lemma plus part of speech for the MWE parts. In text analysis, MWEs are recognised as part of the parsing process, mot as pre- or post-processing components. The analysis design extends the X-bar scheme by a level for multiword rules. In transfer, MWEs are translated as elementary nodes like single word lemmata, to present key concepts for relevance judgement in Information Extraction. Evaluation shows that 90% of the MWE patterns in the lexicon can be analysed with about 150 MWE-specific rules, and that more than 90% of text document tokens are covered by the proposed integrated single and multiword processing.
Chapters in this book
- Prelim pages i
- Table of contents v
- About the editors vii
- Multiword units in machine translation and translation technology 1
-
Part 1. Multiword units in machine translation
- Analysing linguistic information about word combinations for a Spanish-Basque rule-based machine translation system 41
- How do students cope with machine translation output of multiword units? An exploratory study 61
- Aligning verb + noun collocations to improve a French-Romanian FSMT system 81
-
Part 2. Multiword units in multilingual NLP applications
- Multiword expressions in multilingual information extraction 103
- A multilingual gold standard for translation spotting of German compounds and their corresponding multiword units in English, French, Italian and Spanish 125
- Dutch compound splitting for bilingual terminology extraction 147
-
Part 3. Identification and translation of multiword units
- A flexible framework for collocation retrieval and translation from parallel and comparable corpora 165
- On identification of bilingual lexical bundles for translation purposes 181
- The quest for croatian idioms as multiword units 201
- Corpus analysis of croatian constructions with the verb doći ‘to come’ 223
- Anaphora resolution, collocations and translation 243
- Index 257
Chapters in this book
- Prelim pages i
- Table of contents v
- About the editors vii
- Multiword units in machine translation and translation technology 1
-
Part 1. Multiword units in machine translation
- Analysing linguistic information about word combinations for a Spanish-Basque rule-based machine translation system 41
- How do students cope with machine translation output of multiword units? An exploratory study 61
- Aligning verb + noun collocations to improve a French-Romanian FSMT system 81
-
Part 2. Multiword units in multilingual NLP applications
- Multiword expressions in multilingual information extraction 103
- A multilingual gold standard for translation spotting of German compounds and their corresponding multiword units in English, French, Italian and Spanish 125
- Dutch compound splitting for bilingual terminology extraction 147
-
Part 3. Identification and translation of multiword units
- A flexible framework for collocation retrieval and translation from parallel and comparable corpora 165
- On identification of bilingual lexical bundles for translation purposes 181
- The quest for croatian idioms as multiword units 201
- Corpus analysis of croatian constructions with the verb doći ‘to come’ 223
- Anaphora resolution, collocations and translation 243
- Index 257