Home Multimodal constructions revisited. Testing the strength of association between spoken and non-spoken features of Tell me about it
Article Open Access

Multimodal constructions revisited. Testing the strength of association between spoken and non-spoken features of Tell me about it

  • Claudia Lehmann ORCID logo EMAIL logo
Published/Copyright: July 12, 2024

Abstract

The present paper addresses the notion of multimodal constructions. It argues that Tell me about it is a multimodal construction that consists of a fixed spoken and a variable, but largely obligatory multimodality slot on the formal side of the construction. To substantiate this claim, the paper reports on an experiment that shows that, first, hearers experience difficulties in interpreting Tell me about it when it is neither sequentially nor multimodally marked as either requesting or stance-related and, second, hearers considerably rely on multimodal features when a sequential context is missing. In addition, the experiment also shows that the more features are used, the better hearers get at guessing the meaning of Tell me about it. These results suggest that, independent of the question of whether the multimodal features associated with requesting or stance-related Tell me about it are non-spoken, unimodal constructions themselves (like a raised eyebrows construction), a schematic multimodality slot might be part of the constructions.

1 Introduction

Construction Grammar, as the name suggests, is a family of theories on language-related knowledge. The members of this family agree that linguistic knowledge takes the form of constructions, but they differ significantly in how they define these. From a usage-based perspective, “constructions are understood to be emergent clusters of lossy memory traces that are aligned within our high- (hyper!) dimensional conceptual space on the basis of shared form, function, and contextual dimensions” (Goldberg 2019: 7). Put differently, language-related knowledge emerges from language use. Frequency of occurrence plays a huge role in this process of entrenchment, “by which linguistic experiences are mentally encoded and committed to memory” (Divjak 2019: 131). Defined in this way, entrenching structures is a cognitive process relating to the individual, while conventionalization is a social process “defined as the continuous mutual coordination and matching of communicative knowledge and practices” (Schmid 2015: 10). The exact relation between (cognitive) entrenchment and (social) conventionalization is subject to debate (see Divjak 2019: Ch. 5 for an overview). Still, the present paper assumes that entrenchment and conventionalization mutually influence each other: Speakers tend to use constructions they have entrenched and, by using them (frequently), the likelihood increases that they become conventionalized in a community. Vice versa, conventionalized constructions are likely used more frequently in a community, which make them more likely to be entrenched by individual speakers. What is interesting to note about the wordings of the quotes above is the object of analysis that is explicated. While Divjak (2019) refers to linguistic experiences, Schmid (2015) refers to communicative knowledge and practices, and Goldberg (2019) to emergent clusters, without any specificity about the nature of these clusters. However, entrenched linguistic experiences need not be the same as communicative knowledge, since, traditionally, linguistics is concerned with verbal human language (including sign languages) and communication with animal communication systems and non-verbal communication (as a more recent example of this conceptualization see the scope statement of the recently launched journal Language Communication, which accepts articles dealing with “insights into human communication with language, including the interface with non-verbal modalities” (Vulchanova 2024). Even though the door seems open for usage-based construction grammar to also embrace patterns that have not been considered linguistic in the past, many constructions under study are almost exclusively verbal[1] ones (see e.g., Goldberg 2019; Hilpert 2019; Hoffmann 2022).

Spoken language, however, almost always occurs in multimodal situations, e.g., in face-to-face interactions, which is its most natural habitat (Feyaerts et al. 2017; Perniss 2018; Vigliocco et al. 2014). Following Bateman et al., multimodality is defined as “a way of characterising communicative situations (considered very broadly) which rely upon combinations of different ‘forms’ of communication to be effective” (2017: 7). More precisely, a semiotic mode (i.e., a form of communication) is a “three-way layered configuration of semiotic distinctions developed by a community of users in order to achieve some range of communicative or expressive tasks” (Bateman 2022: 68). The three layers of a semiotic mode are the material substrate (i.e., the ‘stuff’ that is manipulated to create meaning), the form (i.e., the regularities that can be found in the ‘stuff’), and the discourse semantics (i.e., the meaning). Given this definition of modes, spoken face-to-face interactions are multimodal, because meaning in these situations is not only conveyed by sounds combined into morpho-syntactic patterns, but also prosody, bodily-visual behaviour, or object manipulation, to name the most prominent modes (for a more thorough discussion of mode and why prosody is a mode different from sounds see Lehmann 2024). First language acquisition happens in such multimodal situations and children have to generalize the necessary formal features of a construction to integrate it into their constructicons.

What follows from these observations on language usage is the question of whether constructions, i.e., the lossy memory traces in speakers’ minds, can be multimodal, too. A construction can be considered amodal when its formal properties are not specified for mode. An example is the ditransitive construction. Being schematic, the ditransitive construction is not lexically specified and does neither contain information on its pronunciation nor its spelling. Only in a usage event, its slots become filled with other constructions carrying such information and then, the ditransitive construction can be spoken, written, or even gestured (e.g., when pantomiming). A construction can be considered unimodal when its formal properties carry mode-related information concerning one mode exclusively. This is the case for fixed constructions like cat, which is specified for pronunciation,[2] but neither for prosody nor any kind of bodily-visual behaviour. Likewise, emblematic gestures, i.e., gestures that “take place independently of verbal language, and which are made deliberately, with a full communicative intention” (Payrató and Clemente 2020: 48) qualify as unimodal constructions (in the sense of Goldberg 2019). An example is the thumbs up gesture for approval, which exclusively uses the gestural mode and whose meaning is based on convention (see also Hoffmann 2021). Finally, a construction can be considered multimodal when its formal properties carry mode-related information concerning several modes that are aligned with sufficiently similar functions and contextual dimensions (here: communicative situations). Hoffmann (2021) argues that the domain-general process that drives this alignment is Conceptual Blending (Fauconnier and Turner 2002; Turner 2010): at least two input spaces (here: from two different modes) are combined to create emergent meanings, which have the potential to become a multimodal construction. Ziem (2017) differentiates two kinds of multimodal constructions: inherently multimodal constructions and multimodal constructions based on highly frequent co-occurrence of spoken and non-spoken constructions. Examples of (inherently) multimodal constructions are deictic constructions like this, which need a deictic (pointing) gesture to be used meaningfully in face-to-face interactions (Levinson 2006). To identify inherently multimodal constructions, Ziem (2017) proposes a deletion test (for the two modes in question): if the non-spoken mode is erased from the construction and the construction becomes unintelligible as a result, the construction qualifies as inherently multimodal. Schoonjans (2017) reflects on multimodal constructions based on recurrence and argues that it could be found at a rather abstract level, meaning that a spoken construction could necessitate the co-occurrence of e.g., a gesture, but does not specify any gesture in particular.

Moreover, unimodal spoken constructions may interact with other unimodal (non-spoken) constructions to such an extent that they may be considered cross-modal collostructions. Collostructions are positive associations of varying strength between a simple and a more complex construction like the association between give and the ditransitive construction (Stefanowitsch and Gries 2003). The same relationship might hold between unimodal constructions of different modes, in which case one might speak of cross-modal collostructions (Uhrig 2022) or cross-modal associations (Mittelberg 2017). If the thumbs up gesture is frequently accompanied by a spoken construction also conveying approval (like great), and vice versa, the two would form a cross-modal collostruction. In line with the original definition of collostruction, both units need to qualify as constructions (i.e., they must exhibit shared form, function and contextual dimensions). If the non-spoken aspect in a usage event does not qualify as a construction, it either needs to be considered a multimodal construction (provided that it occurs with sufficient frequency) or an ad-hoc combination.

While the majority of constructions might be unimodal, there is growing evidence that some spoken constructions recur with particular prosodic configurations and/or visuals, which have the potential of blurring the boundaries of purely language-related knowledge – if language is defined in a narrow sense. The present paper argues that multimodal constructions exist by discussing the evidence for two potentially multimodal constructions, i.e., requesting and stance-related Tell me about it. To this end, the paper proceeds as follows: Section 2 provides an overview of research at the (construction) grammar – multimodality interface, which shows that the majority of case studies so far use corpus-based evidence. Section 3 introduces the two constructions under investigation in the present paper and summarizes a corpus study exploring the prosodic and bodily-visual resources that recur with requesting and stance-related Tell me about it, respectively. Section 4 lays out the details of an experiment and a focus group study, which were conducted to test the predictions of the corpus study. Section 5 presents the results of these two studies and Section 6 discusses the results. The results show that hearers use these resources to interpret Tell me about it as either requesting or stance-related. Finally, in Section 7, the three studies are discussed regarding what they reveal about the (potentially) cognitive reality of multimodal constructions. Section 7 argues that, while some of the bodily-visual and prosodic resources, i.e., gaze behaviour, movements in the eyebrow and mouth region, head movements, and duration identified by the corpus study, are unimodal constructions in their own right, and, thus, might form cross-modal collostructions with Tell me about it, the results of the experiment suggest that Tell me about it constructions have an obligatory, but schematic multimodality slot that can be filled by a non-spoken feature with or without constructional status. The results of the focus group discussion further shows that some language users consciously use these multimodal features, thereby providing further evidence for them being entrenched in their constructicon. Section 8 summarizes the main argument of the paper and offers further conclusions.

2 Previous research in multimodal (construction) grammar

As mentioned above, there are communicative patterns that fulfil Goldberg’s (2019) definition of construction, but which are neither spoken nor written (nor are they exclusively used by signers). The most straightforward examples are so-called emblems like the thumbs up gesture, head shakes for negation, or ‘fingers crossed’ for well-wishing. By definition, emblems work independently of spoken language and, furthermore, they also seem to work (largely) independently of sequential context. Because of their autonomy, Kendon (2004) also calls them quotable gestures. Research in the past decade has shown, though, that there are more such unimodal patterns that have not been cited as emblems before, but could be considered ‘bodily-visual constructions’, whose interpretation relies more heavily on their sequential positioning. Smiling, for example, and in particular intense smiling is a marker of humorous events and used to negotiate them (Gironzetti et al. 2019). Gaze behaviour is also highly systematic. For example, speakers tend to look at their interlocutors at the end of their turn to select the next speaker and/or to check on their availability (see Degutyte and Astell 2021 for a systematic review on eye gaze in turn-taking and evidence for gaze as a resource for next-speaker selection). On a different note, Feyaerts et al. (2022) show that the recipients of an expression of self-obviousness react to these by raising their eyebrows in response in a systematic way. And, finally, Debras and Cienki (2012) analysed the use of head tilts in conversation and found that hearers tilt their heads when listening to their conversational partners to show attentiveness. All of these findings can be reanalysed as constructions with a bodily-visual form with a particular function given a particular sequential context.

Apart from bodily-visual constructions, there is also evidence of (unimodal) prosodic constructions. A prosodic construction “is a temporal configuration of prosodic features, has a meaning, is not necessarily closely aligned with words, can be present to a greater or lesser degree, can share aspects of meaning and form with related (sister and daughter) constructions, [and] can appear superimposed with other form-meaning mappings” (Ward 2019: 108). An example of this is the Consider This construction, which, formally consists of a high-pitched and loud voice in a slow tempo, is then followed by a flat pitch contour and ends with another pitch peak. Functionally, it provides information that may be new to the hearer and is presented to advance an ongoing discussion (Ward 2019: 5–23). Inspired by traditional assumptions, there is also research at the intonation (i.e., pitch movement) – construction grammar interface: Marandin (2006) describes French intonation contours as constructions with independent meanings that can match grammatical constructions. More recently, Gras and Elvira-Garcia (2021) show that the Spanish insubordinate conditional construction pairs with independently existing Spanish intonational constructions as long as their meanings are compatible. These examples show that prosodic aspects of speech can constitute independent, unimodal constructions.

In face-to-face interactions, both bodily-visual and prosodic constructions often co-occur with spoken constructions and if they do so frequently and with one spoken construction in particular, this might be called a cross-modal collostruction. Unfortunately, the studies on prosodic constructions mentioned above do not offer sufficient information on the frequencies with which these prosodic constructions co-occur with particular spoken constructions. However, there is some evidence for cross-modal collostructions of bodily-visual and spoken constructions: Mittelberg (2017) argues that German es gibt (there is) is cross-modally associated with gestures of giving or holding (palm-up open hand or the bimanual palm-vertical open hand gesture, respectively), but does not provide frequency counts. Schoonjans (2018) shows that the German particle einfach is accompanied by head shakes in 24 % of all usage events in a corpus of TV data and Austrian parliamentary debates. And Uhrig (2022) shows that verbs of throwing co-occur with a gesture in 54 % of usage events considered. More often than not, this gesture was iconic, i.e., it depicted the act of throwing. To sum up, the frequency observations made in Schoonjans (2018) and Uhrig (2022) as well as the qualitative observations made in Mittelberg (2017) are too systematic to be ignored by usage-based approaches to grammar. At the same time, often enough, the spoken construction is used (and, presumably, understood) without a gesture being co-present. Cross-modal collostructions thus seem to fail Ziem’s (2017) deletion test.

In addition to cross-modal collostructions, for which only a weak association between a spoken construction and, at least, one construction of a different mode can be assumed, there is also tentative evidence for multimodal constructions. Apart from deictic this (Levinson 2006), the German deictic expression so has also been shown to be often accompanied by a deictic gesture (Ningelgen and Auer 2017). Internet memes consist of a written and a visual element and removing one of these runs the risk of making the meme unintelligible, which is why both Dancygier and Vandelanotte (2017) and Bülow et al. (2018) argue that memes are (inherently) multimodal constructions. There is also evidence for inherently multimodal constructions with prosodic elements. Sadat-Tehrani (2010), for example, describes a Persian construction with a specific intonation pattern that seems to be idiosyncratic for this grammatical construction. Likewise, Põldvere and Paradis (2020) show that the reactive What-X (What to Russia) construction, has particular prosodic peculiarities (usually, what is unaccented and the construct is produced with a rise-fall intonation) and is mostly used as a request for reconfirmation. Zima (2017) could also present evidence for a non-inherently multimodal construction. She shows that the construction [all the way from X PREP Y] co-occurs with a gesture in 80 % of uses. The use of a co-speech gesture is highest when the construction is used while pointing at a physical object (86 % of cases). However, a co-speech gesture is also frequently used with the spoken construction in cases of abstract pointing (i.e., pointing at an empty space) establishing a referent in space (86 %) and time (56 %). These latter uses of [all the way from X PREP Y] are good candidates of non-inherently multimodal constructions.

With some exceptions (Feyaerts et al. 2022; Gironzetti et al. 2019; Gras and Elvira-García 2021), most of the studies reviewed above are exclusively based on corpus evidence. Corpus frequencies provide insights into the conventionalization of a construction within in a speech community, but it has been argued that frequency is only one predictor of entrenchment, which is assumed to stand in a mutual relationship to conventionalization. This is because “Conventionalization does not operate in a vacuum: instead, agreement on how to solve a communicative task can only be achieved using structures that have been entrenched by individuals” (Divjak 2019: 136). Schmid (2020: Ch. 11) discusses factors other than frequency that influence entrenchment, i.e., salience, iconicity, and embodiment. Given the insight that corpus frequency can only provide a piece in the puzzle on the question of constructionhood other pieces of evidence are warranted concerning the existence of (multimodal) constructions.[3]

The present paper contributes to the debate on multimodal constructions in two significant ways. First, it puts the frequency observations made for one construction, Tell me about it, to the test. The review of corpus linguistic evidence (Section 3) shows that stance-related Tell me about it is frequently accompanied by acoustic as well as visual features that make it multimodally different from its requesting counterpart. Section 4 will report on a forced choice experiment, in which the entrenchment of these recurring features was tested. The results of this experiment show that the participants made significant use of multimodal features to identify requesting and stance-related Tell me about it out of context. Further corroborating evidence comes from a focus group, which discussed possible features that explain the results of the forced-choice experiment. The second objective of the present paper is to argue that stance-expressing Tell me about it is a multimodal construction because (a) there is still a lack of evidence that all multimodal features co-occurring with stance-related Tell me about it can be described as independent constructions and, more importantly, (b) because even if they can, the results of the experiment show the crucial role multimodal features play in identifying the construction, which makes the assumption of a multimodality slot as an obligatory part of the construction highly plausible.

3 The multimodality of Tell me about it

Tell me about it is ambiguous and can instantiate two different constructions (Lehmann 2023; Lehmann and Bergs 2021). It can be used to request information or to express a (usually negative) affective stance and claim epistemic authority. Example (1)[4] illustrates the use of Tell me about it as a request:

  1. silk thread

    01 SBK: Quilt artist Beth Schillig is here to talk about silk thread for quilting.
    02 Wonderful.
    03 Welcome!
    04 BS: Thank you!
    05 Happy to be here.
    06 SBK: Yeah!
    07 So I’ve not really quilted with silk thread.
    08 Tell me about it.
    09 BS: Okay.
    10 I love sewing with, quilting with silk thread.
    11 I think a lot of people are intimidated by it, though,
    12 and I think it goes back to our feeling or thinking of,
    13 “Oh, silk garments, they’re for special occasions,
    14 so silk is a special occasion thread.”
    ((several lines omitted))
    15 BS: Now, first of all, it does come in different weights,
    16 and what I wanna focus on today is 100 weight silk thread.
    1. (2022-12-08_1500_US_KCET_Democracy_Now; 14:41–15:17).

Example (1) illustrates the usual sequential context in which requesting Tell me about it occurs. Example (1) comes from a morning show. The host, SBK, introduces a guest, BS, who is considered an expert on using silk thread for quilting (line 01). SBK admits that she has no experience with this kind of thread (line 07) and requests more information on this subject using Tell me about it (line 08), thereby initiating speaker transition. After a topical digression on why someone might be “intimidated” (line 11) to use silk thread, BS begins an elaboration (lines 15 and 16), i.e., she provides the requested information.

Example (2) illustrates stance-related Tell me about it:

  1. things take time

    01 OY: And all these things take time.
    02 Very important issues but of course they take time.
    03 How do- How does the public stay engaged with stuff like this.
    04 when, as you know, the legal system sometimes the way it operates.
    05 tends to do so in a way that makes people tune out?
    06 DA: Tell me about it.
    07 As a prosecutor I see that all the time.
    08 People are so frustrated.
    1. (2022-08-23_0900_US_CNN_Early_Start_With_Christine_Romans_and_Laura_Jarrett; 5:47–6:07).

Example (2) comes from a news program and shows a snippet from an interview between OY, the news anchor, and DA, State Attorney for Palm Beach, Florida. OY criticizes the legal system for being slow (lines 01 and 02), which, from his perspective, results in the public losing their interest in legal matters (lines 03 to 05). DA, in turn, agrees with this criticism and, simultaneously, claims epistemic authority by using Tell me about it (line 06). This claim is substantiated by what follows: He reminds the audience that he is a prosecutor (line 07), i.e., an integral part of the legal system, which is, technically, not necessary, because DA has been introduced as a State Attorney already. He also draws attention to the fact that he is familiar with the observation that “people are so frustrated” (line 08), thereby expressing a (negative) affective stance towards the pace of the legal system.

In a nutshell, when compared to requesting Tell me about it, stance-expressing Tell me about it differs in the following aspects:

  1. Its speaker claims epistemic authority.

  2. It expresses an affective stance towards an entity.

  3. It does neither necessitate speaker transition nor an informing sequence.

Using the multimodal archive NewsScape Library of International Television News (Steen and Turner 2013), Lehmann (2023) analysed acoustic and visual features that frequently co-occur with the two uses of Tell me about it. In this large-scale corpus study, she showed that stance-related Tell me about it also differs from its requesting counterpart along the following dimensions, which are listed here in descending order of predictive power:

  1. Speakers tend to raise their eyebrows.

  2. Speakers tend to look anywhere but at the camera and/or the recipient.

  3. Speakers tend to smile.

  4. Speakers tend to either tilt, nod, or shake their heads.

  5. Speakers tend to be slower.

The list above suggests that, overall, the bodily-visual features (raised eyebrows, gaze aversion, smiling and head movements) have stronger predictive power than the prosodic feature (slow tempo).

Some of these features are good candidates of being nonverbal constructions. These are raised eyebrows, the head movements listed above, and smiles. Raised eyebrows have been reported to indicate sarcasm (Tabacaru 2019, 2020; Tabacaru and Lemmens 2014) and assessments more generally (Dix and Groß 2021). Head movements also seem to convey meanings beyond their uses as emblems of negation or affirmation. Head tilts (often as part of shrugs) are used to show disaffiliation with a third party (Debras and Cienki 2012) or obviousness (Jehoul et al. 2017), while head shakes can be used to intensify meanings (McClave 2000), express doubt (Kendon 2002) or as an evidential marker (Kendon 2002; Schoonjans 2018). Head nods can also be used as an evidential marker (Schoonjans 2018). Finally, as mentioned above, smiles have been shown to indicate humour in naturally occurring conversations (Gironzetti et al. 2016, 2019). Based on these observations, assuming pairings of nonverbal form and function seems plausible for these features.

This is less obvious for gaze aversion and speech tempo. As mentioned above, Ward (2019) and Gras and Elvira-García (2021) identify prosody-meaning pairings and, technically, it is possible that stance-related Tell me about it forms a cross-modal collostruction with a prosodic construction. However, slow tempo alone is no prosodic construction since speakers may have many reasons to speak slowly and it seems unlikely that a feature like speech tempo on its own is conventionalized. Still, since Lehmann (2023) in her study on the multimodality of Tell me about it only analysed pitch and speech tempo as prosodic features, there might be other aspects of the prosody of Tell me about it, which might turn out to form a prosodic construction. Similar conclusions can be drawn from the research on gaze aversion. Speakers may have many reasons to not look at their interlocutors. In fact, in face-to-face conversations among peers, speakers often shift their gaze and not looking at the other participants when speaking seems common (Rossano 2013). Recipients, on the other hand, tend to look at the speaker, and systematically look away when making a bid for closure and intend to get the next turn-at talk. Speakers gazing away seems to be particularly long when reacting to a previous stance act and when claiming epistemic authority (Haddington 2006). In addition, Colston (2020) showed that gaze aversion can be a feature of negative stance-taking. In political interviews, though, eye contact is the default, and gaze is usually systematically averted in acts of stance expression (Ekström 2012). Even though there seems to be strong functional compatibility with gaze aversion and stance-related Tell me about it, Haddington (2006) stresses that gaze aversion never functions on its own, but always needs accompanying words and a sequential context to be interpreted correctly. Thus, more systematic attention needs to be paid to such nonverbal resources to make an informed decision about their cognitive status.[5]

This section has shown that Tell me about it is a potential candidate of a multimodal construction because it frequently recurs with visual and acoustic elements that are functionally compatible. As argued in Section 2, though, frequent recurrence does not need to result in entrenchment since memory is selective. The present paper therefore reports on an experiment that tests whether the multimodal features listed above are stored as part of the Tell me about it constructions because if they are, they “should be retrievable under adequate experimental conditions” (Ningelgen and Auer 2017: 1). In addition, the same experiment was also discussed with a focus group to identify strategies for interpreting Tell me about it and whether these features are “even accessible to introspection” (Ningelgen and Auer 2017: 1).

4 Methods

4.1 Experiment

4.1.1 Participants

The experiment was run twice. In the first run, 18 advanced learners of English participated. These were students of English from the University of Bremen, Germany, and participated for course credit. Their first language is German, although some of them also reported to speak Russian and/or Bosnian in addition to German.[6] To be enrolled in the English studies programme at the University of Bremen, students must prove that they know English at least on a B2 level (“independent user”) of the Common European Framework of Reference for Languages (Council for Cultural Co-operation 2001). Most of them, however, reported to know English on a C1 level (“proficient user”).

In the second run, 31 native speakers of English, all residents of the USA, participated. They were recruited via Prolific Academic (Palan and Schitter 2018) and received monetary compensation. The responses of six native-speaker participants had to be removed from the dataset, because their performance was considered poor (see Results for details), leaving 25 native-speaker participants.

A first exploration of the results of the experiment (see Section 4.1.5 for details) showed that the responses by the native and non-native speaker participants were sufficiently similar for language proficiency to be a negligible factor. Thus, the responses of both participant groups were lumped together in the analyses.

4.1.2 Procedure

The experiment was created with SoSciSurvey (Leiner 2021). Before the actual experiment started, the two uses of Tell me about it were introduced to the participants, labelling them “requesting information” and “ironic rejoinder”. The label “ironic rejoinder” was preferred over the term “stance-relating”, because it is the term used by the Oxford English Dictionary Online (Oxford English Dictionary 2023) and therefore considered more appropriate for participants without prior knowledge of linguistics. Since some of the participants were (advanced) learners of English, the term “ironic rejoinder” was also briefly explained. Participants were then presented with different kinds of stimuli featuring Tell me about it and were forced to choose between the two options when answering the question “What is the meaning of Tell me about it in this example?”

4.1.3 Stimuli

The stimuli for the experiment (N = 69) were instances of Tell me about it extracted from the NewsScape Library of International Television News. Table 1 summarizes the different kinds of stimuli.

Table 1:

Kinds of stimuli used in the experiment.

Condition Anticipated interpretation Sample stimulusa
Context Requesting (N = 5)
Stance-expressing (N = 5)
Multimodal Requesting (N = 5)
Stance-expressing (N = 4)
Ambiguous (N = 9)
Visual Requesting (N = 5)
Stance-expressing (N = 5)
Ambiguous (N = 11)
Acoustic Requesting (N = 5)
Stance-expressing (N = 4)
Ambiguous (N = 11)
  1. aTo view the samples, scan QR code or visit the OSF project website. The sample stimuli can be found under Files: https://doi.org/10.17605/OSF.IO/K8B47.

These stimuli were presented in different conditions (see sample stimuli in Table 1), presented in the order provided in Table 1. In the context condition, the participants were given an instance of Tell me about it including previous and following parts of the conversation. In all other conditions, the stimuli were cut so that only Tell me about it could be perceived, without any further sequential context. In the multimodal condition, participants could both hear and see the speaker, while in the visual condition, they could only see (but not hear) them, and in the acoustic condition they could only hear (but not see) them. In the visual condition, the pace of the videos was slowed down, because some of them were so short (i.e., less than 600 ms) that hardly any feature would have been visible because some video players also have a time lag. In addition, the videos in the visual condition were edited to such an extent that only the speaker’s upper body could be seen so as to avoid participants drawing on other visual information than the speaker’s bodily behaviour (such as the News Channel or the recipient’s bodily conduct). Within these conditions, items rotated.

The stimuli were further selected regarding their actual and anticipated interpretation. In the context condition, only unambiguous examples were chosen. This was done for two main reasons: First, to establish a reference level for the statistical analysis, and second, to test whether the participants understood/paid sufficient attention to the task at hand. A third reason of a more practical nature was that there were hardly any ambiguous examples since the sequential context already is quite predictive. In all other conditions, prototypical instances of both requesting and stance-expressing Tell me about it were selected, i.e., instances that the model reported in Lehmann (2023) identified as prototypical, with all features present. For these examples, the actual and the anticipated interpretation were identical. In addition, instances of both requesting and stance-expressing Tell me about it were selected that were at odds with the model prediction, i.e., they only showed some or even none of the features. Their anticipated interpretation was labelled “ambiguous”.

The unique design of the experiment, i.e., that the stimuli were selected observations from the corpus study that needed to fulfil the criteria mentioned above, necessitated some pragmatic problem-solving. For one thing, three of the observations from the corpus study had to be reused[7] in the acoustic condition. It was assumed that reusing observations in the acoustic condition from the multimodal and/or visual condition would be less noticeable to the participants than reusing observations in the visual condition from the multimodal condition (or even the context condition). Judging from sporadic, informal conversations with the participants after the experiment, this procedure was successful. All of the other stimuli occurred only once in the experiment. In addition, combining conditions and anticipated interpretations to find prototypical stimuli suitable for the experiment (while, at the same time, keeping the number of duplicated stimuli at an absolute minimum) resulted in uneven numbers of stimuli across conditions.

A detailed list of stimuli as well as the stimuli themselves can be found in the following repository: https://doi.org/10.17605/OSF.IO/K8B47.

4.1.4 Expected results

The experiment was designed to test whether the multimodal features of requesting and stance-related Tell me about it identified in Lehmann (2023) are, in fact, associated with their respective constructions. If they truly are, the following results should be obtained:

(H1)

The correctness of identifying prototypically requesting and prototypically stance-related stimuli presented in the multimodal condition should not differ from the correctness of identifying stimuli in the context condition.

(H2)

The correctness of responses is lower in both the visual and the acoustic conditions than in the multimodal and context conditions because some multimodal features are missing.

(H3)

More specifically, the model presented in Lehmann (2023) predicts that the correctness of responses in the acoustic condition should be lower than in any other condition, because the speech tempo, which is the most important indicator available in this condition, has the least predictive power in the model.

(H4)

The correctness of responses is generally lower for ambiguous examples than for prototypically requesting and prototypically stance-related stimuli because features are missing.

4.1.5 Statistical analysis

For the statistical exploration of the results of the forced choice experiment, R (R Core Team 2022) was used. The raw results were visualized using the ggplot2 package (Wickham et al. 2023). The glmer function of the lme4 package (Bates et al. 2015) was used to fit a generalized linear mixed-effects model. Correctness of the response (i.e., whether the response was in line with the meaning of this stimulus in the original usage event) was treated as the dependent variable. In the initial model, participant, language proficiency, stimulus, and construction were entered as random intercepts, while condition and anticipated interpretation were entered as fixed effects. This led to problems with convergence. An inspection of the model with the summ function of the jtools package (Long 2022) showed that language proficiency and participant were negligible effects; they were, therefore, removed from the model to simplify its complex structure. No problems with convergence occurred thereafter. The summ function was used to summarize the fitted model, including the computation of confidence intervals, and the plot_model, as well as the effect_plot function of the sjPlot package (Lüdecke 2021), was used to visualize the model. The datasets as well as the R script can be found in the repository mentioned above.

4.2 Focus group

The participants of the focus group were seven students (1 male, 6 females, in their mid-twenties) from the University of Potsdam. They were all advanced learners of English comparable to the non-native speaker participants of the experiment and attended a class on Multimodal Construction Grammar on MA level. Because of their participation in this class, they had basic experience with identifying gestures and facial expressions and they were aware that co-speech gestures can be meaningful. However, they had neither been introduced to multimodal construction grammar proper, nor to the multimodality of Tell me about it. As the focus group session happened around mid-term, the students were already familiar with the course instructor (i.e., the author of this paper) and with one another and had already established a discussion-friendly atmosphere in previous sessions. As part of the course, they did the experiment above, however, their responses were not recorded. Thus, the participants in the experiment are not identical with the participants in the focus group. After having completed the questionnaire, they were informed about the purpose of the experiment and were asked about what informed their choice. The students could respond freely, i.e., they were not forced to provide answers, but when they did, the other students were encouraged to discuss these responses. The outcome of their discussions was recorded in a collaborate writing tool: the course instructor summarized the outcome of the discussion in list format and the members of the focus group could edit the list (anonymously) whenever they saw the need to do so.

5 Results

As mentioned earlier, the responses of six participants from the native speaker cohort were removed from further analyses. The stimuli in the context condition were unambiguous and most participants, including all non-native speakers, had no problems in identifying their meanings correctly. However, six native speakers responded on chance level (with a ratio of correct responses around 0.5) and, therefore, their data was considered poor quality and was removed on these grounds. The following results are based on the tidied dataset.

Figure 1 illustrates the overall distribution of the ratio of correct guesses.

Figure 1: 
Grouped boxplot (including jittered points) showing the ratio of correct guesses per condition and anticipated interpretation. Points marked with asterisks indicate outliers.
Figure 1:

Grouped boxplot (including jittered points) showing the ratio of correct guesses per condition and anticipated interpretation. Points marked with asterisks indicate outliers.

Table 2 summarizes the fitted model.

Table 2:

Summary of the fitted model. The context condition and requesting Tell me about it act as reference levels and, therefore, do not appear in the table.

Model info:
Observations: 3010

Dependent variable: correctness

Type: mixed effects generalized linear regression

Error distribution: binomial

Link function: logit
Model fit:
AIC = 1997.97, BIC = 2046.05

Pseudo-R2 (fixed effects) = 0.36

Pseudo-R2 (total) = 0.64
Fixed effects:
Est. 2.5 % 97.5 % z val. p
(Intercept) 5.87 3.46 8.27 4.79 0.00
Multimodal −2.03 −3.97 −0.10 −2.06 0.04
Visual −4.09 −5.97 −2.20 −4.25 0.00
Acoustic −3.79 −5.69 −1.89 −3.91 0.00
Stance-related 0.74 −0.74 2.22 0.98 0.33
Ambiguous −1.06 −2.16 0.04 −1.89 0.06
Random effects:
Group Parameter Std. Dev.
Stimulus (Intercept) 1.25
Construction (Intercept) 0.97
Grouping variables:
Group # Groups ICC
Stimulus 70 0.27
Construction 2 0.16

With a Pseudo-R2 value of 0.64, the fitted model summarized in Table 2 is good, but not excellent, at explaining the variation in guessing the correct meaning.[8] Figure 1 suggests that this is likely because there are some stimuli in the visual and acoustic condition that were anticipated to be prototypical but are extremely low in being guessed correctly. These unanticipated results will be discussed in Section 6. Still, taken together Figure 1 and Table 2 support hypotheses H1 and H2, but not H3 and H4.

H1 claims that the correctness of identifying prototypically requesting and prototypically stance-related stimuli presented in the multimodal condition should not differ from the correctness of identifying stimuli in the context condition. Table 2 shows that the stimuli presented in the multimodal condition in fact differ significantly from those in the context condition, but only with borderline significance (p < 0.04). This suggests that, overall, participants had more difficulties in identifying the meaning of Tell me about it when they had no sequential context to rely on. However, the p-values for stance-related (p = 0.33) and ambiguous stimuli (p = 0.06) suggest that this is due to the ambiguous stimuli rather than the stance-related ones. Figure 1 confirms this: In the multimodal condition, the participants had no difficulties identifying the prototypically requesting stimuli (with a median ratio of correct guesses of 1) and hardly any difficulties identifying the prototypically stance-related stimuli (with a median ratio of correct guesses of more than 0.95). Thus, H1 is supported.

The first part of H2 predicts that the correctness of responses is lower in both the visual and the acoustic conditions than in the multimodal and context conditions. Figure 1 provides some initial support for this hypothesis since, overall, the median ratios of correct guesses are lower in the visual and the acoustic condition when compared to the median ratios of correct guesses in the context and the multimodal condition. In other words, participants performed worse in guessing the meaning of Tell me about it when being presented with visual or acoustic information only. The model summarized in Table 2 confirms that guessing the correct meaning of both the visual and the acoustic stimuli is significantly harder than guessing the correct meaning of stimuli presented in context (with both p < 0.01). Figure 2 further substantiates H2. It shows that, when it comes to estimates and confidence intervals, there is hardly any overlap between the context condition and neither the visual nor the acoustic condition, while there is only some overlap between the multimodal condition and both the visual and acoustic condition. This suggests that the probability of correctly interpreting Tell me about it when only given visual or acoustic information decreases. The second part of H2 further claims that the correct guesses are lower in the visual and acoustic condition because some multimodal features are, by definition, already missing (prosodic features in the visual condition and visual features in the acoustic condition). This claim is supported, too. Figure 1 reveals that this difference in guessing the correct meaning can be mainly attributed to the ambiguous stimuli, although the ambiguous stimuli only reach borderline significance (p < 0.06) in the model reported in Table 2. This is because, contrary to expectation, some of the ambiguous stimuli, apparently, did not pose any difficulties to the participants, while some of the prototypical stimuli did. These observations will be discussed in Section 6 in more detail. Still, overall, H2 is supported: if more features are missing, it becomes more difficult for hearers to guess the meaning of Tell me about it.

Figure 2: 
Effect plot illustrating the estimate and confidence intervals of condition and correct interpretation.
Figure 2:

Effect plot illustrating the estimate and confidence intervals of condition and correct interpretation.

H3 claims that the correctness of responses in the acoustic condition should be lower than in any other condition. Figure 2 shows that while stimuli in the acoustic condition were guessed less often correctly than in the context and the multimodal condition, participants were slightly better in the acoustic condition than in the visual condition. Figure 1 confirms this since the median ratios of correct guesses (independent of the anticipated interpretation) are lower in the acoustic than in the context and multimodal condition. Figure 1 also shows that, in comparison to the visual condition, the participants were better at identifying requesting and stance-related Tell me about it in the acoustic condition (with more than 90 % and 80 % correct guesses, respectively), but worse at identifying the meaning of the ambiguous ones (with only more than 60 % correct guesses). All in all, H3 cannot be supported.

H4 claims that the correctness of responses is generally lower for ambiguous stimuli than for prototypically requesting and prototypically stance-related stimuli. Figure 1 partially supports this claim. It shows that all median ratios of correct guesses are lower for the ambiguous stimuli than the prototypically requesting and stance-related ones within the same condition. However, the difference in median ratios is greatest in the acoustic condition, but less prominent in the others. The model summarized in Table 2 suggests that the overall difference is only borderline significant (p < 0.06). Figure 3 further illustrates the difference. Figure 3 shows that, overall, the participants were fairly good at guessing the meaning of Tell me about it. For the ambiguous stimuli, the estimated chance of guessing correctly is lower than for the other stimuli and the confidence interval is much larger, i.e., the difference might be significant, but the estimates do not fully support H4. Resorting to Figure 1, it can be assumed that the differences observed can be attributed to the acoustic condition. Thus, H4 is not (fully) supported.

Figure 3: 
Effect plot illustrating the estimate and confidence intervals of anticipated and correct interpretation.
Figure 3:

Effect plot illustrating the estimate and confidence intervals of anticipated and correct interpretation.

In addition to the experimental results, a focus group was asked to do the experiment and discuss the features that informed their choice. Their discussion resulted in a list of (unstructured) features, which was drafted by the present author and edited (anonymously) by the members of the focus group in a collaborative writing tool. The author assigned the features of the list to categories, i.e., acoustic features, visual features, contextual features, and other, summarized here as Table 3 for convenience’s sake.

All features mentioned by the participants in the focus group were included in the original list (and in Table 3) as long as the majority did not object to it. For example, one participant reported that they often felt overwhelmed and either used their gut feeling, contextual pieces of information or chose requesting Tell me about it in case of doubt. Some participants confirmed to have used the latter strategy, too, but others reported to have used prosodic and bodily-visual features throughout. None of the features presented in Table 3 were actively confirmed by all participants.

Table 3:

Features the focus group reported on having used in the forced choice experiment.

Requesting Tell me about it Ironic Tell me about it
Acoustic features “questioning” intonation, rising pitch, faster “assertive” intonation, falling pitch, slower
Visual features “pointing gestures” (with hand, head or chin) “closed body language”, head turn, lifted eyebrows, rolling eyes, laughing/smiling
Context News format, CNN news anchor FOX News, comedian
Other Preferred option in case of doubt

6 Discussion

In this section, the results of the experiment regarding the hypotheses that were formulated are discussed. The implications of these findings for the notion of the multimodal construction are discussed in Section 7.

The first hypothesis, i.e., that stimuli that are multimodally marked as either requesting or stance-related can easily be identified as such, has been confirmed experimentally. This means that the participants of the experiment were able to identify Tell me about it correctly even when it was not embedded in a sequential context. Since they had no other features than the visual and acoustic ones to base their interpretation on, it can be concluded that there must be some kind of association between requesting and stance-related Tell me about it and the multimodal features they recur with, respectively. Some participants of the focus group reported that they based their interpretation partly on the speaker and the Channel on which the stimulus was aired. More specifically, the participants reported that if they knew that the speaker of Tell me about it was a comedian or if the stimulus was aired on FOX News, they tended to interpret the stimulus as stance-related rather than requesting. While drawing on supposed knowledge like this is a confounding variable, its impact can be considered negligible, since, first, not all of the participants of the focus group reported having used such a strategy and, second, because all of them agreed that other features seemed more important and that such a strategy was only chosen in case of doubt. Therefore, most of the responses can be assumed to be based on the multimodal display of Tell me about it and its association with these features. Unfortunately, the design of the experiment does not allow any conclusions about the exact nature of this association, but possibilities are discussed in Section 7.

The results regarding the second and the fourth hypothesis will be discussed together because similar conclusions can be drawn from them. Taken together, the two hypotheses stated that participants will perform worse the more features are missing. Regarding the participants’ (worse) performance in the visual and the acoustic dimension (H2), this could be confirmed, but regarding their performance for ambiguous stimuli in general (H4), the results are ambivalent. From these results, it can be concluded that the participants applied some form of probabilistic interpretation strategy: the more features supported their interpretation, the easier it was for them to decide. Furthermore, it can be concluded that some features were sufficiently supportive to the extent that participants had no difficulties at all. As explained above, the nature of the experiment does not allow any definite conclusions on the kinds of features that are highly supportive, but two exceptional stimuli will be discussed in some detail here.

The first stimulus (VI04), which was presented in the visual condition, does not show all the multimodal features of stance-related Tell me about it and is, therefore, not considered prototypical, but was largely interpreted correctly. Figure 4 provides a screenshot of this stimulus. It shows that the speaker smiles and holds her head in a tilted position, which supports the interpretation of being stance-related. Other features of prototypically marked stance-related Tell me about it, however, are missing. The stimulus does not allow the participants to see the gaze direction of the speaker precisely, but her eyes at least seem to be directed at the recipient. In addition, her eyebrows stay in a neutral rest position throughout the snippet. Having been presented in the visual condition, the participants further lacked acoustic features to draw on. Even though the participants could only rely on two features, the ratio of correct responses was very high nonetheless (40 out of 43 participants identified it correctly). Given this finding, it seems that some features (here: smiling and/or a tilted head) are sufficiently good predictors of stance-related Tell me about it and make up for the missing ones.

Figure 4: 
Screenshot of an ambiguous stimulus presented in the visual condition.
Figure 4:

Screenshot of an ambiguous stimulus presented in the visual condition.

The second stimulus (MU10), which was presented in the multimodal condition, shows features of both stance-related and requesting Tell me about it and was therefore considered ambiguous. Actually, it is an example of stance-related Tell me about it, but the results show that many participants identified it as requesting (only 11 out of 43 participants identified this stimulus correctly). Figure 5 shows a screenshot of this stimulus illustrating some of its multimodal features. As can be seen in Figure 5, the speaker (to the right) uses two features of stance-related Tell me about it, i.e., raised eyebrows and a smile, and two features typical of requesting Tell me about it, i.e., gaze directed at the recipient and a still head. In addition, with a duration of only 407 ms, the tempo of this stimulus is rather fast and, thus, may also serve as an indicator of requesting Tell me about it. It seems that, in this case, the fast tempo plus the eye contact with the interlocutor and the still head overrode the visual features of smiling and raised eyebrows. As mentioned earlier, the experiment reported here does not allow any conclusions on the exact weighing of individual features, but regarding these two examples, it seems safe to conclude that smiling is not a sufficiently good predictor from the participants’ perspective, although it is statistically significant.

Figure 5: 
Screenshot of an ambiguous stimulus presented in the multimodal condition. The speaker is the person to the right.
Figure 5:

Screenshot of an ambiguous stimulus presented in the multimodal condition. The speaker is the person to the right.

A similar conclusion can be drawn from the third hypothesis, which claimed that the stimuli presented in the acoustic condition must be the most difficult ones to identify correctly, because the only feature to base responses on was speech tempo. However, this could not be supported. The results suggest that speech tempo alone is a good predictor of stance-related Tell me about it and that it is at least as good at predicting stance-related Tell me about it as the features used in the visual condition. Therefore, even though the model of stance-related Tell me about it presented in Lehmann (2023), which was based on frequency observations, suggests that speech tempo is a less good predictor, the results of the experiment suggest that it is at least as good a predictor as the visual features taken together and may even outweigh them (see discussion of stimulus MU10 above). Consequently, only because a non-spoken feature is used more frequently by speakers, hearers need not rank it higher when interpreting Tell me about it.

There might be another reason, however, why the participants, overall, performed worse in the visual condition than in the acoustic condition. As mentioned above, due to technical reasons, the pace of the visual stimuli had to be slowed down. This might have influenced their choice in the experiment, not least because of the assumed speech tempo. Even though the participants were informed that the videos are slower than their originals, the participants might have made some assumptions about the speech tempo (which was not perceptually available to them). This might explain why they were worst at identifying prototypically requesting Tell me about it in the visual condition: Because the video seemed longer, they might have assumed the (original) utterance to be longer, too.

Apart from discussing these hypotheses, the unanticipated results regarding stimuli which were categorized as prototypical call for systematic attention. For example, the stimulus AC07, which was presented in the acoustic condition, is 792 ms long and, thus, according to the model presented in Lehmann (2023), sufficiently long in duration to be identified as stance-related. The results of the experiment show, though, that with only 24 correct responses out of 43 possibly correct responses, this stimulus seemed ambiguous to the participants. Some of the other acoustically prototypical, stance-related stimuli, i.e., AC09 and AC11 are similar in duration, but were identified as stance-related. Such a finding implies that there are probably more acoustic features associated with stance-related Tell me about it than duration only. This is also, in parts, confirmed by the focus group. Given the fact that the participants have had no prior training in speech prosody, it is striking that they were able to name speech tempo as a feature. They also reported that “intonation” and pitch were different for the two constructions, but they were not able to name the differences explicitly (which is a result of their lack of prosodic training). As argued in Section 3, this possibility of further acoustic features being associated with Tell me about it increases the likelihood of the presence of a prosodic construction associated with Tell me about it.

Apart from the results reported above, the results of the focus group offer further insights into the possible entrenchment and conventionalization of the multimodal features of Tell me about it. The participants in the focus group act as naïve informants on which aspects of the stimuli are used to interpret Tell me about it. If they mentioned, like one participant in fact did, that they were overwhelmed by the task and only used their gut feeling, contextual information (TV personality, TV program), and other strategies, but not multimodal features, this would present some evidence against Tell me about it being a multimodal construction, because the multimodal features would not even be considered as relevant for the interpretation. In such a case, the multimodal features were categorized as “gut feelings” and only a weak association could be assumed. But not all participants reported that. Even though some of them are rather vague, e.g., “closed body language”, and did not get any more specific even after further inquiry, it is interesting to note that almost all the features that were tested were also explicitly mentioned as features for interpreting Tell me about it, i.e., speech tempo, raised (“lifted”) eyebrows, smiling and laughing and head movements (“head turn”) by the majority of participants. This means, in turn, that the associations between the multimodal features and Tell me about it are strong enough to exceed the level of “gut feelings” and can be named explicitly. One exception is gaze aversion, which was not mentioned. Instead, the participants claimed that the speakers of stance-related Tell me about it “roll their eyes”. Since eye-rolling is one way to avert gaze, this can nonetheless be treated as a related feature. Overall, these results show that the multimodal features used with Tell me about it reach some level of consciousness for some language users. If this was not so, the participants had not been able to name them at all. The fact that the participants of the focus group received some basic training in gesture studies certainly helped them focus on multimodality and voice their observations in detail. But they were not biased in any direction and yet most of their observations confirm the multimodal features identified in Lehmann (2023), while also identifying new features (e.g., pointing gestures used with requesting Tell me about it).

7 General discussion

The first main takeaway from the study above is that frequency observations alone are insufficient for measuring the entrenchment and conventionalization of (possibly) multimodal constructions. The experiment shows that even if frequently recurring features used with stance-related Tell me about it are used (like raised eyebrows), hearers may find it ambiguous. Vice versa, both the experiment and the results of the focus group show that some features, which might less frequently be used with Tell me about it, such as (slower) speech tempo, are good predictors and reach some level of conscious acknowledgement. Moreover, both the experiment and the focus group study suggest that the multimodal features that are frequently used with Tell me about it are entrenched by some, but possibly not all, language users since the experiment did not confirm all features of Tell me about it from the corpus study and not all participants in the focus group discussion mentioned them. What is striking nevertheless is that the participants, who need not know the speakers from the corpus study[9] (who, in turn, served as speakers in the experimental stimuli) were, often enough, able to interpret Tell me about it correctly based on multimodal features alone (provided that there were sufficient). This observation might speak for the claim that these multimodal features are conventionalized in the (American English) speech community. As already argued elsewhere for unimodal constructions (Gries 2003), pieces of evidence other than frequency are necessary to draw conclusions on the entrenchment of constructions.

However, the question still remains whether requesting and stance-related Tell me about it are multimodal constructions. As shown above, one way to identify a multimodal construction is by identifying recurring features that aren’t independent constructions. It was also argued in Section 3 that gaze aversion and speech tempo are the only likely candidates for construction-dependent patterns. The discussion of the responses regarding stimulus MU10 has shown that either of them or both features together seem to be strong predictors of either construction. On the other hand, only tempo was mentioned explicitly by the focus group. As of now, the evidence here speaks in favour of Tell me about it being multimodal constructions with gaze behaviour and speech tempo as largely optional, but entrenched features that are integral part of the construction. However, this preliminary conclusion needs to be treated with a grain of salt: Just because these two features cannot reasonably be described as independent constructions with the evidence available up to now, doesn’t mean that they aren’t, in fact. More empirical research on the systematicity of gaze and prosodic constructions is necessary to arrive at a final conclusion.

Regarding the other features associated with stance-related Tell me about it, i.e., head movements, raised eyebrows, and smiling, both the results of the focus group and the experiment have shown that these are (moderately) associated with the construction. When in conflict with other features, they may be outweighed, as they seemed to be in the experiment, but the focus group explicitly mentioned them as features of stance-related Tell me about it. As argued in Section 3, these features can be described as independent unimodal constructions. Therefore, it seems plausible to assume that stance-related Tell me about it forms a cross-modal collostruction with these non-spoken constructions. However, this conclusion doesn’t exclude the view that stance-related Tell me about it is a multimodal construction. The experiment has shown that without any supportive feature, participants could not assign Tell me about it to a construction when being confronted with it out of context. It also showed that when more features are used, hearers get more confident in categorizing Tell me about it. This points in the direction of stance-related Tell me about it having a schematic slot for multimodal features. This schematic slot allows several non-spoken constructions (or other patterns) to be specified, while not all of them need to be realized eventually. The slot itself, though, seems to be obligatory, since realizing neither non-spoken feature is only possible in highly supportive contexts.

8 Summary and conclusion

The present paper has argued that requesting and stance-related Tell me about it are multimodal constructions. To do so, the association with the multimodal features that recur with both constructions was experimentally tested. The results have shown that while frequency cannot be equated with entrenchment, the features that recur with Tell me about it are both retrievable under experimental conditions and consciously accessible to (some) language users. What is more, the experiment has further shown that, if missing, language users could not identify the construction unambiguously. These results provisionally point in the direction of Tell me about it being constructions with obligatory multimodality slots, therefore being multimodal constructions.

Apart from this main conclusion, the present paper also offers more. First, a “deletion test” to identify multimodal constructions like it was proposed in Ziem (2017) needs refinement. More often than not, previous studies on multimodal constructions revolved around the patterning of one spoken and one non-spoken form. Both the corpus and the experimental results on Tell me about it show, however, that there might be more than one non-spoken feature associated with a spoken form and that deleting one might not result in a failure provided that other features are sufficiently predictive. At the same time, deleting all (multimodal and contextual) features would likely result in a communicative breakdown. Although the experiment reported here highly suggests such a prediction, this was not tested directly. What is more, the experiment presented here involved quite a few confounds (video pace in the visual condition, other pieces of information like TV program), which might have had an impact on the results, but they cannot be exclusively responsible for the main finding that language users also seem to rely on knowledge about non-spoken constructions. A second further conclusion is that linguistically informed research on nonverbal constructions is still warranted. There are notable works in this direction, especially regarding prosodic constructions and gestures, but the picture remains too fragmented to draw definite conclusions. And, third, the fact that a visual or prosodic construction may form a cross-modal collostruction with a spoken form does not exclude this form from being part of a multimodal construction.


Corresponding author: Claudia Lehmann, University of Potsdam, Potsdam, Germany, E-mail:

Data availability statement

The dataset, the R script and the stimuli used in the current study are openly available on OSF (see Files in the registered project entitled Multimodal constructions revisited): https://osf.io/k8b47.

References

Bateman, John. 2022. Growing theory for practice: Empirical multimodality beyond the case study. Multimodal Communication 11(1). 63–74. https://doi.org/10.1515/mc-2021-0006.Search in Google Scholar

Bateman, John, Janina Wildfeuer & Tuomo Hiippala. 2017. Multimodality: Foundations, research and analysis – a problem-oriented introduction. Berlin: De Gruyter Mouton.10.1515/9783110479898Search in Google Scholar

Bates, Douglas, Martin Mächler, Ben Bolker & Steve Walker. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67(1). 1–48. https://doi.org/10.18637/jss.v067.i01.Search in Google Scholar

Bülow, Lars, Marie-Luis Merten & Michael Johann. 2018. Internet-Memes als Zugang zu multimodalen Konstruktionen. Zeitschrift für Angewandte Linguistik 69. 1–32. https://doi.org/10.1515/zfal-2018-0015.Search in Google Scholar

Colston, Herbert L. 2020. Eye-rolling, irony and embodiment. In Angeliki Athanasiadou & Herbert L. Colston (eds.), The diversity of irony, 211–235. Berlin: De Gruyter Mouton.10.1515/9783110652246-010Search in Google Scholar

Council for Cultural Co-operation; Education Committee; Modern Languages Division. 2001. Common European Framework of Reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press.Search in Google Scholar

Dancygier, Barbara & Lieven Vandelanotte. 2017. Internet memes as multimodal constructions. Cognitive Linguistics 28(3). 565–598. https://doi.org/10.1515/cog-2017-0074.Search in Google Scholar

Debras, Camille & Alan Cienki. 2012. Some uses of head tilts and shoulder shrugs during human interaction, and their relation to stancetaking. Paper presented at the International conference on Privacy, Security, risk and trust and International conference on Social Computing Amsterdam, 3–5 September.10.1109/SocialCom-PASSAT.2012.136Search in Google Scholar

Degutyte, Ziedune & Astell Arlene. 2021. The role of eye gaze in regulating turn taking in conversations: A systematized review of methods and findings. Frontiers in Psychology 12. 1–22. https://doi.org/10.3389/fpsyg.2021.616471.Search in Google Scholar

Divjak, Dagmar. 2019. Frequency in language: Memory, attention and learning. Cambridge: Cambridge University Press.10.1017/9781316084410Search in Google Scholar

Dix, Carolin & Alexandra Groß. 2021. Raising both eyebrows in interaction. Paper presented at the 17th International Pragmatics Conference Winterthur and online, 27 June – 2 July 2021. https://event.ipra2021.exordo.com/presentation/1123/raising-both-eyebrows-in-interaction.Search in Google Scholar

Ekström, Mats. 2012. Gaze work in political media interviews. Discourse & Communication 6(3). 249–271. https://doi.org/10.1177/1750481312452200.Search in Google Scholar

Fauconnier, Gilles & Mark Turner. 2002. The way we think: Conceptual blending and the mind’s hidden complexities. New York: Basic Books.Search in Google Scholar

Feyaerts, Kurt, Geert Brône & Bert Oben. 2017. Multimodality in interaction. In Barbara Dancygier (ed.), The Cambridge handbook of cognitive linguistics, 135–156. Cambridge: Cambridge University Press.10.1017/9781316339732.010Search in Google Scholar

Feyaerts, Kurt, Christian Rominger, Helmut Karl Lackner, Geert Brône, Annelies Jehoul, Bert Oben & Ilona Papousek. 2022. In your face? Exploring multimodal response patterns involving facial responses to verbal and gestural stance-taking expressions. Journal of Pragmatics 190. 6–17. https://doi.org/10.1016/j.pragma.2022.01.002.Search in Google Scholar

Gironzetti, Elisa, Salvatore Attardo & Lucy Pickering. 2016. Smiling, gaze, and humor in conversation: A pilot study. In Leonor Ruiz-Gurillo (ed.), Metapragmatics of humor: Current research trends, 235–256. Amsterdam: Benjamins.10.1075/ivitra.14.12girSearch in Google Scholar

Gironzetti, Elisa, Salvatore Attardo & Lucy Pickering. 2019. Smiling and the negotiation of humor in conversation. Discourse Processes 56(7). 496–512. https://doi.org/10.1080/0163853X.2018.1512247.Search in Google Scholar

Goldberg, Adele E. 2019. Explain me this: Creativity, competition, and the partial productivity of constructions. Princeton: Princeton University Press.10.2307/j.ctvc772nnSearch in Google Scholar

Gras, Pedro & Wendy Elvira-García. 2021. The role of intonation in Construction Grammar: On prosodic constructions. Journal of Pragmatics 180. 232–247. https://doi.org/10.1016/j.pragma.2021.05.010.Search in Google Scholar

Gries, Stefan Th. 2003. Towards a corpus-based identification of prototypical instances of constructions. Annual Review of Cognitive Linguistics 1(1). 1–27. https://doi.org/10.1075/arcl.1.02gri.Search in Google Scholar

Haddington, Pentti. 2006. The organization of gaze and assessments as resources for stance taking. Text & Talk 26(3). 281–328. https://doi.org/10.1515/text.2006.012.Search in Google Scholar

Hilpert, Martin. 2019. Construction Grammar and its Application to English, 2nd edn. Edinburgh: Edinburgh University Press.10.1515/9781474433624Search in Google Scholar

Hoffmann, Thomas. 2021. Multimodal construction grammar: From multimodal constructs to multimodal constructions. In Xu Wen & John R. Taylor (eds.), The Routledge Handbook of cognitive linguistics, 78–92. New York: Routledge.10.4324/9781351034708-6Search in Google Scholar

Hoffmann, Thomas. 2022. Construction grammar: The Structure of English. Cambridge: Cambridge University Press.Search in Google Scholar

Jehoul, Annelies, Geert Brône & Kurt Feyaerts. 2017. The shrug as marker of obviousness. Linguistics Vanguard 3(s1). 1–9. https://doi.org/10.1515/lingvan-2016-0082.Search in Google Scholar

Kendon, Adam. 2002. Some uses of the head shake. Gesture 2(2). 147–182. https://doi.org/10.1075/gest.2.2.03ken.Search in Google Scholar

Kendon, Adam. 2004. Gesture: Visible Action as utterance. Cambridge: Cambridge University Press.10.1017/CBO9780511807572Search in Google Scholar

Lehmann, Claudia. 2023. Multimodal markers of irony in televised discourse: A corpus-based approach. In Lucien Brown, Iris Hübscher & Andreas H. Jucker (eds.), Multimodal im/politeness: Signed, spoken, written. Amsterdam: Benjamins.10.1075/pbns.333.09lehSearch in Google Scholar

Lehmann, Claudia. 2024. What makes a multimodal construction? Evidence for a prosodic mode in spoken English. Frontiers in Communication 9. 1–15. https://doi.org/10.3389/fcomm.2024.1338844.Search in Google Scholar

Lehmann, Claudia & Alexander Bergs. 2021. As if irony was in stock: The case of constructional ironies. Constructions and Frames 13(2). 309–339. https://doi.org/10.1075/cf.00053.leh.Search in Google Scholar

Leiner, Dominik. 2021. SoSci survey (Version 3.2.31) [Computer software]. https://www.soscisurvey.de.Search in Google Scholar

Levinson, Stephen C. 2006. Deixis. In Laurence R. Horn & Gregory Ward (eds.), The handbook of pragmatics, 97–121. Malden: Blackwell.10.1002/9780470756959.ch5Search in Google Scholar

Long, Jacob A. 2022. jtools (Version 2.2.0) [Computer software]. https://CRAN.R-project.org/package=jtools.Search in Google Scholar

Lüdecke, Daniel. 2021. sjPlot (Version Version R package version 2.8.10) [Computer software]. https://cran.r-project.org/package=sjPlot.Search in Google Scholar

Marandin, Jean-Marie. 2006. Contours as constructions. Constructions 10(s1). 1–28. https://doi.org/10.24338/cons-448.Search in Google Scholar

McClave, Evelyn Z. 2000. Linguistic functions of head movements in the context of speech. Journal of Pragmatics 32(7). 855–878. https://doi.org/10.1016/s0378-2166(99)00079-x.Search in Google Scholar

McFadden, Daniel. 1979. Quantitative methods for analysing travel behaviour of individuals. In David A. Hensher & Peter R. Stopher (eds.), Behavioural travel modelling, 279–318. New York: Routledge.10.4324/9781003156055-18Search in Google Scholar

Mittelberg, Irene. 2017. Multimodal existential constructions in German: Manual actions of giving as experiential substrate for grammatical and gestural patterns. Linguistics Vanguard 3(s1). 1–14. https://doi.org/10.1515/lingvan-2016-0047.Search in Google Scholar

Ningelgen, Jana & Peter Auer. 2017. Is there a multimodal construction based on non-deictic so in German? Linguistics Vanguard 3(s1). 1–15. https://doi.org/10.1515/lingvan-2016-0051.Search in Google Scholar

Oxford English Dictionary. 2023. Oxford: Oxford University Press.Search in Google Scholar

Palan, Stefan & Christian Schitter. 2018. Prolific. ac—a subject pool for online experiments. Journal of Behavioral and Experimental Finance 17. 22–27. https://doi.org/10.1016/j.jbef.2017.12.004.Search in Google Scholar

Payrató, Lluís & Ignasi Clemente. 2020. Gestures we live by: The pragmatics of emblematic gestures. Berlin: De Gruyter Mouton.10.1515/9781501509957Search in Google Scholar

Perniss, Pamela. 2018. Why we should study multimodal language. Frontiers in Psychology 9. 1–5. https://doi.org/10.3389/fpsyg.2018.01109.Search in Google Scholar

Põldvere, Nele & Paradis Carita. 2020. ‘What and then a little robot brings it to you?’ The reactive what-x construction in spoken dialogue. English Language and Linguistics 24(2). 307–332. https://doi.org/10.1017/S1360674319000091.Search in Google Scholar

R Core Team. 2022. R: A language and environment for statistical computing. (Version 4.2.1) [Computer software]. R Foundation for Statistical Computing. https://www.r-project.org/.Search in Google Scholar

Rossano, Federico. 2013. Gaze in conversation. In Jack Sidnell & Tanya Stivers (eds.), The handbook of conversation analysis, 308–329. Malden: Wiley-Blackwell.10.1002/9781118325001.ch15Search in Google Scholar

Sadat-Tehrani, Nima. 2010. An intonational construction. Constructions 3. 1–13.Search in Google Scholar

Schmid, Hans-Jörg. 2015. A blueprint of the entrenchment-and-conventionalization model. Yearbook of the German Cognitive Linguistics Association 3. 3–25. https://doi.org/10.1515/gcla-2015-0002.Search in Google Scholar

Schmid, Hans-Jörg. 2020. The dynamics of the linguistic system: Usage, conventionalization, and entrenchment. Oxford: Oxford University Press.10.1093/oso/9780198814771.001.0001Search in Google Scholar

Schoonjans, Steven. 2017. Multimodal construction grammar issues are construction grammar issues. Linguistics Vanguard 3(s1). 1–8. https://doi.org/10.1515/lingvan-2016-0050.Search in Google Scholar

Schoonjans, Steven. 2018. Modalpartikeln als multimodale Konstruktionen: Eine korpusbasierte Kookkurrenzanalyse von Modalpartikeln und Gestik im Deutschen. Berlin: De Gruyter.10.1515/9783110566260Search in Google Scholar

Steen, Francis & Mark Turner. 2013. Multimodal construction grammar. In Mike Borkent, Barbara Dancygier & Jennifer Hinnell (eds.), Language and the creative mind, 255–274. Stanford: CSLI Publications.Search in Google Scholar

Stefanowitsch, Anatol & Stefan Th. Gries. 2003. Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2). 209–243. https://doi.org/10.1075/ijcl.8.2.03ste.Search in Google Scholar

Tabacaru, Sabina. 2019. A multimodal study of sarcasm in interactional humor. Berlin: De Gruyter Mouton.10.1515/9783110629446Search in Google Scholar

Tabacaru, Sabina. 2020. Faces of sarcasm: Exploring raised eyebrows with sarcasm in French political debates. In Athanasiadou Angelik & Herbert L. Colston (eds.), The diversity of irony, 256–277. Berlin: De Gruyter Mouton.10.1515/9783110652246-012Search in Google Scholar

Tabacaru, Sabina & Maarten Lemmens. 2014. Raised eyebrows as gestural triggers in humour: The case of sarcasm and hyper-understanding. European Journal of Humour Research 2(2). 11–31. https://doi.org/10.7592/EJHR2014.2.2.tabacaru.Search in Google Scholar

Turner, Mark. 2010. Conceptual integration. In Dirk Geeraerts & Hubert Cuyckens (eds.), The Oxford handbook of cognitive linguistics, 377–393. Oxford: Oxford University Press.Search in Google Scholar

Uhrig, Peter. 2022. Hand gestures with verbs of throwing: Collostructions, style and metaphor. Yearbook of the German Cognitive Linguistics Association 10. 99–120. https://doi.org/10.1515/gcla-2022-0006.Search in Google Scholar

Vigliocco, Gabriella, Pamela Perniss & David Vinson. 2014. Language as a multimodal phenomenon: Implications for language learning, processing and evolution. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 369(1651). 20130292. https://doi.org/10.1098/rstb.2013.0292.Search in Google Scholar

Vulchanova, Mila (ed.). 2024. About this section. https://www.frontiersin.org/journals/communication/sections/language-communication/about.Search in Google Scholar

Ward, Nigel G. 2019. The prosodic patterns of English conversation. Cambridge: Cambridge University Press.10.1017/9781316848265Search in Google Scholar

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani & Dewey Dunnington. 2023. ggplot2 (Version 3.4.2) [Computer software]. https://CRAN.R-project.org/package=ggplot2.Search in Google Scholar

Ziem, Alexander. 2017. Do we really need a multimodal construction grammar? Linguistics Vanguard 3(s1). 1–9. https://doi.org/10.1515/lingvan-2016-0095.Search in Google Scholar

Zima, Elisabeth. 2017. On the multimodality of [all the way from X PREP Y]. Linguistics Vanguard 3(s1). 1–12. https://doi.org/10.1515/lingvan-2016-0055.Search in Google Scholar

Received: 2023-08-24
Accepted: 2024-06-15
Published Online: 2024-07-12
Published in Print: 2024-08-27

© 2024 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 19.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/cog-2023-0095/html
Scroll to top button