Empirical variability of Italian multiword expressions as a useful feature for their categorisation
-
Luigi Squillante
Abstract
In contemporary linguistics the definition of those entities which are referred to as multiword expressions (MWEs) remains controversial. It is intuitively clear that some words, when appearing together, have some “special bond” in terms of meaning (e.g. black hole, mountain chain), or lexical choice (e.g. strong tea, to fill a form), contrary to free combinations. Nevertheless, the great variety of features and anomalous behaviours that these expressions exhibit makes it difficult to organise them into categories and gives rise to a great amount of different and sometimes overlapping terminology.
So far, most approaches in corpus linguistics have focused on trying to automatically extract MWEs from corpora by using statistical association measures, while theoretical aspects related to their definition, typology and behaviours arising from quantitative corpus-based studies have not been widely explored, especially for languages with a rich morphology and relatively free word order, such as Italian.
This contribution attests that a systematic analysis of the empirical behaviour of Italian MWEs in large corpora, with respect to several parameters, such as syntactic and lexical variations, is useful for outlining a categorisation of the expressions in homogeneous sets which approximately correspond to what is intuitively known as multiword units (“polirematiche” in the Italian lexicographic tradition) and lexical collocations. The importance of this kind of approach is that the resulting categorisation of MWEs is grounded on empirical data rather than relying on intuitive and not-always-coherent linguistic definitions.
The variational features taken into account are (1) the possibility for the expressions to be syntactically transformed, and (2) the possibility for one of the component to be replaced with a synonym. These features can be automatically and quantitatively investigated using ad hoc designed tools, whose methodology is fully explained, if an annotated corpus and a list of expressions are provided. It is possible to show that the kind of attested variations and the magnitude of variation appear highly correlated to the grammatical structure of a given phrase, indicating that the bond between the components for a multiword unit or a lexical collocation can be formed by activating different kinds of restrictions, depending on the considered grammatical pattern.
Abstract
In contemporary linguistics the definition of those entities which are referred to as multiword expressions (MWEs) remains controversial. It is intuitively clear that some words, when appearing together, have some “special bond” in terms of meaning (e.g. black hole, mountain chain), or lexical choice (e.g. strong tea, to fill a form), contrary to free combinations. Nevertheless, the great variety of features and anomalous behaviours that these expressions exhibit makes it difficult to organise them into categories and gives rise to a great amount of different and sometimes overlapping terminology.
So far, most approaches in corpus linguistics have focused on trying to automatically extract MWEs from corpora by using statistical association measures, while theoretical aspects related to their definition, typology and behaviours arising from quantitative corpus-based studies have not been widely explored, especially for languages with a rich morphology and relatively free word order, such as Italian.
This contribution attests that a systematic analysis of the empirical behaviour of Italian MWEs in large corpora, with respect to several parameters, such as syntactic and lexical variations, is useful for outlining a categorisation of the expressions in homogeneous sets which approximately correspond to what is intuitively known as multiword units (“polirematiche” in the Italian lexicographic tradition) and lexical collocations. The importance of this kind of approach is that the resulting categorisation of MWEs is grounded on empirical data rather than relying on intuitive and not-always-coherent linguistic definitions.
The variational features taken into account are (1) the possibility for the expressions to be syntactically transformed, and (2) the possibility for one of the component to be replaced with a synonym. These features can be automatically and quantitatively investigated using ad hoc designed tools, whose methodology is fully explained, if an annotated corpus and a list of expressions are provided. It is possible to show that the kind of attested variations and the magnitude of variation appear highly correlated to the grammatical structure of a given phrase, indicating that the bond between the components for a multiword unit or a lexical collocation can be formed by activating different kinds of restrictions, depending on the considered grammatical pattern.
Chapters in this book
- Prelim pages i
- Table of contents v
- Foreword vii
- Introduction 1
- Monocollocable words 9
- Translation asymmetries of multiword expressions in machine translation 23
- German constructional phrasemes and their Russian counterparts 43
- Computational phraseology and translation studies 65
- Computational extraction of formulaic sequences from corpora 83
- Computational phraseology discovery in corpora with the mwetoolkit 111
- Multiword expressions in comparable corpora 135
- Collecting collocations from general and specialised corpora 151
- What matters more: The size of the corpora or their quality? 177
- Statistical significance for measures of collocation strength 189
- Verbal collocations and pronominalisation 207
- Empirical variability of Italian multiword expressions as a useful feature for their categorisation 225
- Too big to fail but big enough to pay for their mistakes 247
- Multi-word patterns and networks 273
- How context determines meaning 297
- Detecting semantic difference 311
- Index 325
Chapters in this book
- Prelim pages i
- Table of contents v
- Foreword vii
- Introduction 1
- Monocollocable words 9
- Translation asymmetries of multiword expressions in machine translation 23
- German constructional phrasemes and their Russian counterparts 43
- Computational phraseology and translation studies 65
- Computational extraction of formulaic sequences from corpora 83
- Computational phraseology discovery in corpora with the mwetoolkit 111
- Multiword expressions in comparable corpora 135
- Collecting collocations from general and specialised corpora 151
- What matters more: The size of the corpora or their quality? 177
- Statistical significance for measures of collocation strength 189
- Verbal collocations and pronominalisation 207
- Empirical variability of Italian multiword expressions as a useful feature for their categorisation 225
- Too big to fail but big enough to pay for their mistakes 247
- Multi-word patterns and networks 273
- How context determines meaning 297
- Detecting semantic difference 311
- Index 325