Abstract
The question-answer adjacency pair constitutes the most frequent sequence organization in online medical consultations (hereafter OMCs). Previous studies on OMC have primarily examined text-based interactions. However, in live-streamed free online medical consultations (hereafter FOMCs), doctors appear on video streaming and display paralinguistic cues, enabling the investigation of the multimodal realization of their response discourse. Drawing on a self-built small-scale multimodal corpus of live-streamed FOMCs, this study adopts the Multimodal Interaction Analysis framework (Norris 2004. Analyzing Multimodal Interaction: A Methodological Framework. Routledge) and Epistemic Gradient Theory (Heritage 2013. “Epistemics in Conversation.” In The Handbook of Conversation Analysis, 370–94. Wiley-Blackwell), and employs ELAN for quantitative annotation of approximately 9 h of video data from eight doctors. The analysis focuses on five modal layers of doctors’ nonverbal responses: gaze, head movements, vocal particles, body postures, and gestures. The findings reveal that the doctors mobilize multimodal resources to accomplish three key interactional tasks: displaying uptake of patient information, managing projections in patients’ questions, and projecting upcoming properties. Notably, doctors frequently employ an information maximization strategy to elaborate beyond the immediate scope of the question, supported by listing gestures or embodied demonstrations using anatomical models or their own body parts. These findings may contribute to improving communicative efficiency in online consultations and provide practical implications for the development of multimodal communication training programs for online medical professionals.
1 Introduction
Linguistic research on online medical consultations (OMCs) has flourished over the past decade. Scholars have examined doctors’ professional identities and discourse strategies (Mao and Zhao 2019, 2020; Wang et al. 2021), the dynamic power relations between doctors and patients (Zhang 2021), and the structural organization of OMCs as institutional interactions in digital environments (Ren and Li 2023). In recent years, there has been a noticeable shift from studying patient-centered consultations to investigating patients’ own discourse (Zhao and Yuan 2023) and identity construction (Ren and Li 2025; Zhao and Mao 2025). In other words, linguistic research on OMCs has itself become more patient-centered. This operation can be attributed to two main factors: first, technological advancements have accelerated the adoption of online medical consultations; and second, text-based corpora are relatively easy to collect and transcribe.
However, this convenience has also led scholars to overlook non-textual forms of OMCs, particularly those conducted through audio or video. Recently, several medical consultation platforms have introduced live-streamed sessions for free online medical consultations (henceforth FOMCs), which, while serving the public interest, have also opened new possibilities for investigating OMCs from a multimodal perspective. Nevertheless, due to privacy regulations, patients can only type their questions in the chat box, whereas only the doctor---as the live-streamer---appears on camera. Consequently, it can be challenging to determine which patient the doctor is responding to, especially when the chat box contains a high volume of consultation messages.
Through close observation of a large body of live-streamed content, we found that in some FOMCs, doctors’ assistants assume the role of patients while simultaneously managing the flow of the live broadcast, such as by selecting patients’ questions and reading them aloud. Consequently, doctors communicate at once with anonymous, digitalized patients and with familiar, physically co-present assistants. The multiplicity, improvisation, and complexity of these interactions constitute the entry point of this study---assistant-mediated FOMCs. Given that only the doctors are visible on screen, this study focuses on their nonverbal responses in coordination with their verbal ones.
From the verbal perspective, doctors usually withhold their responses until the assistant finishes repeating a question posted by an anonymous patient. In this way, one interaction in FOMCs is achieved through the turn-taking between the doctor and the assistant and its sequential organization should conform to the “question-answer” adjacency pair. From the multimodal perspective, however, the doctor responds to the question through nonverbal resources almost instantly when the assistant commences reading the question, details can be seen in Extract 1(A is for assistant and D is for doctor, similarly hereinafter):
Extract 1 (Corpus ID: OP_D_X18)

Figure 1 Gaze and Facial Expression of the Doctor

Figure 2 The line 2 of question 18
((Due to space constraints, the rest of the conversation is omitted here))
As shown in Figure 1, the doctor shifted her gaze from somewhere else to the camera in less than 0.1 s after the assistant started saying something. She began smiling when she heard the word “the next” which signals that the assistant was about to repeat a question in the chat box. In the FOMCs context, consultants can obtain doctors’ visual information through the screen; in other words, the camera can be regarded as the consultant’s “electronic prosthetic eye.” As such, at the very moment when the doctor consciously looked at the camera and smiled, she had begun to respond to the consultant’s question via multimodal resources.
In Figure 2, the doctor’s nonverbal responses can also be found when the assistant read the question verbatim. Approximately 0.2 s after the assistant read “This one says”, the doctor slightly turned her head and shifted her gaze toward the assistant, which served as a response to the latter. And when the assistant read “hello, Doctor”, she shifted her gaze back to the camera as a response to the consultant’s greeting. After that, she nodded three times when the assistant read out the age of the consultant.
The two lines of conversation above only span the first 5 s of the consultation. During this period, the assistant spoke approximately 20 Chinese characters, while the doctor simultaneously made about 7 responses through three types of nonverbal resources. Among these the shifting of gaze is most frequently used. This is because the initiator and beneficiary of the consultation is the patient, yet the impromptu participant of the conversation is the assistant. Therefore, the doctor must not only answer the semantic content of patient’s questions but also respond to the pragmatical demands of the assistant’s speech, resulting in the complexity of this interaction.
To this purpose, this study tries to answer three questions:
What multimodal resources do doctor use as responses in FOMCs?
How those multimodal resources are coordinated with verbal responses?
For what purposes do the doctors employ these multimodal resources?
2 Theoretical Framework
In a single FOMCs webcast, two higher-level actions, i.e. consultations with patients and the live broadcast in collaboration with the assistant, are conducted simultaneously, with several instances of the former constituting the latter. This study then will first investigate how doctors construct their turns via multimodal resources as responses to the questions under the framework of epistemic gradient (Heritage 2013), consequently follows the multimodal interaction framework (Norris 2004) to explore the relationship between these two higher-level actions.
2.1 The Epistemic Gradient Between Doctors and Patients
When two speakers have different levels of knowledge about a domain, they form an epistemic gradient in that domain, with each party occupying a different epistemic status (Heritage 2013). Speakers adjust their actual speech based on their judgment of epistemic differences in conversations, thus reflecting their own epistemic stance. Generally, doctors possess more professional knowledge, while patients have a better understanding of their own medical history and experiences (Ma and Gao 2018). Therefore, there are two epistemic gradients to be balanced in doctor-patient conversations: in terms of the epistemic status of the patient’s personal situation, doctors are less knowledgeable (K-) while patients are more knowledgeable (K+); whereas for the epistemic status of disease analysis and treatment plans, the two parties are in the opposite positions. Only when doctors have a good grasp of the patient’s medical history and condition can they give relatively appropriate responses. We hence mainly explore how doctors use multimodal resources to prepare, design, and construct their responsive turns in FOMCs.
2.2 Foreground-Background Continuum
The expressions achieved by communicative modes, such as utterances, gestures, or postural shifts, are low-level actions which are chained together to perform higher-level actions that participants are engaged in, such as meeting with friends or chatting on the phone (Norris 2004). In multimodal interaction, modal density is comprised of modal intensity, which means the importance of a mode in the interaction, and modal complexity, in which means the interaction is co-constructed through multiple modes (Zhang and Wang 2016). When speakers participate in more than one higher-level action simultaneously, their attention may shift back and forth between these different actions, thus forming a foreground-background continuum. The FOMCs data lead us to revisit that doctors simultaneously participate in two higher-level actions, namely the consultation with patients and the co-live broadcasting with assistants. Therefore, we will apply the modal density foreground-background continuum to analyze what modes are employed by doctors to perform these two higher-level actions.
3 Data and Method
3.1 Data Collection
The data are collected from Haodaifu Zaixian (literally, ‘Good Doctor Online’, https://www.haodf.com/), one of the well-known OMC platforms in China. As of October 2025, this platform has approximately 290,000 real-name certified doctors from over 10,000 hospitals across China. The platform has launched a live streaming section, where approximately 10–30 doctors conduct online free medical consultation live streams every day. The duration of a single live stream generally ranges from 20 min to 2 h, depending on the doctor’s own work schedule and platform arrangements. The number of questions raised by patients also varies from dozens to hundreds per session.
The corpus collected in this study meets the following criteria: first, to ensure the visibility of multimodal resources, doctors’ upper bodies must be clearly visible in the camera; second, the live stream is completed by the doctor and his/her assistant(s); third, the specific text of the consultants’ questions must have been preserved, and the doctor has provided a relatively detailed answer to each question.
Based on the above three criteria and under the premise of data sampling by individual or group (Herring 2004), we first screened approximately 300 live streams over two weeks, resulting in an initial selection of 32 live streams with a total duration of about 37 h. Given the difficulty of multimodal annotation, we selected and recorded eight FOMCs conducted by different doctors, with a total duration of approximately 9 h. The video frame rate was 30 FPS, and the number of consultants’ questions with preserved text was counted as 87.
3.2 Multimodal Transcription
All corpora included in the study are in Mandarin. To demonstrate the temporal coupling between multimodal resources the doctor uses, we adopted a three-segment transcription format: Chinese Pinyin (Chinese phonetic alphabet), word by word English translation, and transcriptions for nonverbal multimodal resources, with English translation below each excepts. The verbal transcripts are in accordance to the convention of conversation analysis by Jefferson (2004).
The nonverbal multimodal resources were first transcribed in the ELAN annotation tool to analyze their temporal relevance. Images of annotations in ELAN and screenshots of doctors are provided when necessary (Mondada 2014a; Norris 2004). To integrate the annotations with the transcription of the verbal modality, this study further referred to and simplified[1] the multimodal transcription convention proposed by Heath (1986; Heath et al. 2010), Mondada (2011), and Groeber and Pochon-Berger (2014). This approach aimed to present, as comprehensively as possible, the synergy between all modalities demonstrated by doctors during free medical consultation conversations. When one or more multimodal resources are used, their specific manifestations are marked with abbreviations in ELAN (e.g., “N” stands for “nodding”), and these resources are transcribed based on the temporal synchronization of multiple modalities displayed in ELAN. To this purpose, in the textual transcription, each nonverbal response in the third segment corresponds temporally to the verbal response above it. Additionally, the category of the nonverbal response is also marked at the end of each line.
4 Three Tasks of Constructing Doctors’ Responses
According to the second corpus selection criterion, all consultations adopted in this study follow the sequence organization of patients’ questions, assistants’ verbal repetition of the questions, and doctors’ responses to the questions. As shown in Extract 1, the doctor’s responsive behavior occurs almost simultaneously with the assistant’s repetition of the questions. From the doctors’ perspective, therefore, the doctor-patient interaction is initiated at the very moment when the assistant reads out the question. In all the 87 question-answer pairs, we found 81 cases in which doctors made multimodal responses while listening to the questions.
Deppermann (2013) argues that participants in a conversation need to accomplish four tasks when starting to construct turns: achieve joint orientation, display uptake, deal with projections from prior talk, and project properties of turn-in-progress. We find this self-evident phenomenon that the first task is achieved once the conversation between the doctor and the patient in FOMCs begins, as the patient’s health is a mutually assumed joint goal for both. We in the primary case, will investigate how doctors use multimodal resources to handle the rest three tasks. Specifically, how doctors demonstrate their understanding of patients’ questions, respond to them through different modal resources, and project upcoming actions.
4.1 Display Uptake: from k- to k+
As we annotated doctors’ responses in ELAN tool, we found doctors’ multimodal responses can be divided into two types depending on whether the patient provided adequate information in the questions: some responses occurred during the reading of questions and the others after the reading.
4.1.1 Head Movements and Vocal Particles during the Repetition of Questions
The former questions are constructed by several disjointed phrases, nouns, or mere digits, indicating the patients may have undergone certain physical examinations or have a relatively comprehensive understanding of their own medical history; whereas the latter ones usually consist of one or two short interrogative sentences which depicting vague feelings or questioning about a specific symptom instead of the detailed conditions of the patient.
In FOMCs, doctors typically utilize head movements (henceforth HMs), vocal particles (henceforth VPs) as responses to display their uptake to the first type of questions during the reading of the them. The HMs such as nodding and VPs such as “hmm” are defined as “general responses” which are not tied to the meaning of any particular narrative or point in a narrative (Bavelas and Gerwing 2011). In our corpus, however, these two responses often occur after the turn-construction units (henceforth TCUs) of the questions, and sometimes they are synchronized, such as in Extract 2:
Extract 2 (Corpus ID: OP_L_X22)

Figure 3 HMs and VPs at TRP

Translation:
Assistant: Then, the patient says, well, “Now, I have myopia of 800°, and the physical examination showed that I have macular cleavage. Then I just want to ask if surgery is needed for this condition.”
Doctor: Ask him how old he is.
In the question above, the patient first provided two pieces of information: his myopia degree and macular cleavage and proposed his question at the end of the sentence: Should he undergo surgery based on the above conditions. However, in the patient’s original text, he did not clarify who was the subject of the surgery. During the assistant’s repetition of the question, the doctor made HMs and VPs responses between each turn. It was only when the final interrogative sentence was delivered that the doctor opened his lips to prepare for the response.
A turn is usually composed of one or more TCUs (Sacks et al. 1974) and accordingly, each TCU is a coherent and complete expression, and the establishment of a TCU presupposes the completion of the turn and further projects the establishment of a transition relevance place (henceforth TRP) (Clayman 2013). Obviously, the patient’s text-based question can be regarded as a complete turn, but the assistant’s repetition split it into three turns resulting in the doctor made multimodal responses at every TRP. Indeed, in FOMCs, the multimodal responses used by the doctor to display uptake stem from the pragmatic need to interact with the assistant, and this pragmatic need is closely related to the semantic and syntactic structure of the question raised by the patient. The higher-level action of consultation with patients and that of the co-live broadcasting with assistant are influenced by each other.
In Extract 2, the patient had obtained the physical examination results before this consultation. Therefore, the patient was in a k+ status in the epistemic gradient of their own medical condition and history, and further demonstrated a k+ epistemic stance through the relatively detailed question. The doctor responded to each piece of information provided by the patient through nodding and “hmmm” which, according to Deppermann (2013), expressing confirmation. Consequently, the epistemic gradient between the two parties was gradually balanced. However, it can be seen from line 4 that the information provided by the patient was still insufficient. Instead of following the rule of “question-answer” adjacency pairs, the doctor asked a follow-up question about the patient’s age. As shown in Extract 2, before the assistant had fully repeated the question, the doctor had already opened his lips as a preparation to ask about the patient’s age. It can be seen at the end of line 3 that there was a certain degree of overlapping between the assistant’s repetition and the doctor’s vocal preparation. Being at the k+ status in the epistemic gradient of ophthalmic health, the doctor judged that without the patient’s age information, it would be difficult to advise the patient on whether to undergo surgery or not. Therefore, the doctor instructed the assistant to request the age information from the patient. The doctor established their identity as an expert (Ma and Gao 2018) and further held the floor by simultaneously demonstrating a k- stance in the patient’s personal information and a k+ stance in the field of ophthalmic knowledge.
4.1.2 Body Postures After the Repetition of Questions
Doctors also utilize body postures (henceforth BPs) and gaze (henceforth GZs) to display uptake, especially when they receive relative short questions. The constant transition of GZs is clearly shown in Extract 1 while BPS usually preceded or accompanied doctors’ verbal responses, occupying the turn-initial position. Most of the doctors in FOMCs sit behind a desk on which their arms are placed, while cameras and assistants are usually in front of them. If the doctor lean forward when listening to question, they may lean backward as a signal of their turn initiation, as shown in Extract 3:
Extract 3 (Corpus ID: GD_L_x5)

Figure 4 The change of BP

Translation:
Assistant: What are the treatment methods for ovarian adhesions?
Doctor: For ovarian adhesions, first, we need to clarify whether treatment is necessary.
As demonstrated in Extract 3, in line 1 when the assistant was reading aloud the question the doctor maintained a forward-leaning posture, with her hands placed on the table before her. After line 1 a silent interval of roughly 1 s occurred during which the direction of the doctor’s gaze cannot be identified in line 2. In line 3, the doctor inhaled gently, and her gaze refocused on the camera, indicating the completion of her thinking. Roughly 0.5 s later, her posture shifted to a reclined position and she initiated her responsive turn simultaneously.
From the perspective of the epistemic gradient, the essence of the doctor’s posture changes indicates the transition from the K- to K+ status: the question posted by the patient was too short and vague to be responded instantly so the doctor needed 1.5 s to organize her thoughts. Therefore, during the silence, she remained in the forward-leaning posture (LF) which corresponded to the cognitive processing phase, marking the transition from K- to K+ status. Finally, the backward-leaning posture (LB) signals the doctor’s entry into the K+ status as her forearms left the desk and rested naturally. With her upper body fully relaxing on camera, the doctor thereby establishing an “authoritative posture” for the subsequent professional response. This process also coordinates the two higher-level actions of co-live broadcasting and consultation simultaneously: the forward lean is a response to the assistant’s background action of conveying the question, while the backward lean shifts to the foreground consultation action with the patient. The posture freezing during the transition phase avoids the abruptness of action shifting and maintains the fluency of interaction in the digital context. Notably, such significant BPs changes (from forward to backward leaning) mostly occur when patients pose short questions with vague information; in contrast, for the detailed long questions such as in Extract 2, doctors mostly maintain a neutral sitting posture and only display uptake through subtle HMs and VPs. This indicates that the visual intensity of body postures is negatively correlated with the information sufficiency of patients’ questions: the vaguer the information, the more significant the posture changes. This helps compensate for the delay in verbal uptake and ensures the effective balance of the doctor-patient epistemic gradient.
4.2 Deal with Projections from Prior Talk: the Information Maximization Strategy
Expectations to be responded in the next turn are projections of the prior turn (Deppermann 2013). As shown in prior Extracts, the projections of questions in FOMCs are rather fixed since the conversation between doctors and patients conforms to the rules of institutional discourse. The granularity of those projections, however, varies according to the question types. The 87 questions can be categorized into three types:
47 polar questions that project yes-or-no responses;
33 ask-for-help questions that project diagnosis or treatment advice;
9 questions that without specific request.
There are two conversations in which the patients asked both a type-1 question and a type-2 question. Accordingly, the polar questions project responses with rather low granularity whereas the type-2 and type-3 questions project responses with rather high granularity. In FOMCs, however, doctors always provide more than enough information to answer the questions regardless of their types. Different gestures can be classified into four categories: iconic gestures, metaphoric gestures, deictic gestures, and beat gestures (Norris 2004). Here a gesture unit is an articulation movement excursion of the body parts and it can be divided into three phases: preparation, stroke, and recovery (Kendon 2004). We found in FOMCs that doctors were prone to resort to two types of gestures (henceforth GEs) when they applied the information maximization strategy, namely the gestures for listing and the gestures for embodiment.
4.2.1 Beat Gestures for Listing
Since doctors in FOMCs sit behind desks, their arms and heads are typical gesture resources that they can use. Sometimes both modes are used to list all the possible diagnosis and treatment in the form of beat gestures, such as in Extract 4:
Extract 4 (Corpus ID: GY_H_X10)

Figure 5 GEs and HMs in the line 2 and line 3

Figure 6 GEs and HMs in the line 4 and line 5

Figure 7 GEs and HMs in the line 6 and line 7

Figure 8 GEs and HMs in the line 8 and line 9
Translation:
A: Well, for a 37-year-old woman with an AMH level of 0.75 ng/mL, 2–3 follicles on each ovary, and normal sperm quality in the male partner, what ovarian stimulation protocol do you think is suitable?
D: It doesn’t just depend on my assessment alone; this also needs to be carried out based on her own monthly condition. Of course, you can use the antagonist protocol if you want, and it’s also acceptable to use the PPOS (Progestin-Primed Ovarian Stimulation) protocol to accumulate embryos. However, it still needs to be determined by combining her specific condition that month, her hormone levels, her height, weight, and other factors such as her comprehensive medical history.
In Extract 4, the patient first provided some physical examination information and asked the doctor which ovulation induction protocol was suitable for her. In line 2, the doctor did not respond to the projection of treatment advice directly. Instead, she provided a dis-preferred answer, in which is structurally different from preferred answer, as the dis-preferred is designed to incorporate with inter-turn and turn-initial delays, prefaces, accounts, and mitigations (Lee 2013), that her own assessment could not serve as an appropriate recommendation. In this case the dis-preferred response is followed by the latter two. In line 3, the doctor made her first account that her diagnosis needed to be based on the patient’s specific monthly situation. It is the exact time when she started to utilize gestures (PRE1) as her hands tapped each other with vertical palms (PVT1 and PVT2). They did not return to their home position but were frozen in the post-stroke hold phase (PH1) at the end of the turn, indicating that the doctor still had her speakership (Sacks and Schegloff 2002). In line 4 and line 5, she still listed two relatively common ovulation induction protocols: the antagonist protocol and the PPOS protocol (Deng et al. 2025). According to multimodal transcription, the utterances about these two protocols were accompanied by two mutual taps with palms downward (PDT2 and PDT3) as well as two nods (N7 and N8). These two lines served as a mitigation to the dis-preferred response and also laid the groundwork for the next four lines as accounts. In line 6, 7, 8, and 9, the doctor restated her view that a comprehensive diagnosis needed to be made based on the patient’s personal specific conditions and further listed these conditions: her hormone levels, height, weight, and “other factors”. While saying these four noun phrases, the doctor respectively performed mutual a hands-tap with palms upward (PUT), an outward turning of both hands (HTO2), a hands-holding gesture with palms upward (PUH), and a hands-waving with palms upward (PUW2), and each gesture was accompanied by a nod (N15, N16, N17, N18). In the final summary stage, when the doctor said, “a comprehensive medical history”, her gestures entered the recovery phase (REC).
4.2.2 Deictic Gestures for Embodiment
As a carrier of illness, the patient’s body has always been the foundation of topics in doctor-patient consultations. While some patients are often in a k+ status regarding their own medical history, they are in k- status with general human anatomical. To provide enough medical information, doctors not only use beat gestures to list diagnosis or treatment, but also employ deictic gestures to explain recondite concepts or ideas in their verbal responses. Specifically, we found that doctors mainly rely on anatomical models or their own body parts to illustrate the site of condition, as shown in Extract 5 and Extract 6.
Extract 5 (Corpus ID: OR_B_X5.1)

Figure 9 GEs with a model in line 1, line 2, and line 3

Figure 10 GEs with a model in line 4, line 5, and line 6
Translation:
Many fractures are often diagnosed by doctors through palpation and experience: doctors feel the location of the fracture through the skin, muscles, and ligaments, and then use some functional changes to further (a word hard to identify) assess the reduction status of the fracture. This is often a functional reduction.
Extract 5 is an excerpt from an orthopedic doctor’s consultation. The consultation started from a patient’s question that whether a surgery was needed as he had a left radial head fracture with anterior dislocation of the elbow joint, along with nearby bone fragments. During his explanation, the doctor introduced two treatment approaches: anatomical reduction from Western medicine and functional reduction from traditional Chinese medicine. To illustrate how traditional Chinese doctors conducted functional reduction examinations without technological tools such as CT scans in the past, he used a medical anatomical model near him. Interestingly, the model he showed was not of the elbow joint, but of the spine. In the excerpt from Extract 5, the doctor did not mention specific body parts at all from the moment he fetched the model (MF) to when he (MR); the model’s purpose was to demonstrate the palpation technique used in traditional Chinese medicine. In line 2, the doctor introduced that to locate patients’ fractures, traditional Chinese doctors used hands to palpate skin, muscles, and ligaments respectively. He noted that as attached directly to bones ligaments were much closer to bones than the other two. Therefore, when talking about the “skin” and “muscles”, he used the model as a baton to preform two beat gestures (MW1 and MW2), and his left hand’s fingers were tightly pinching it as in Image 3. In contrast, when referring to “ligaments,” he demarcated its range on the model (MD). Thus, this gesture that pointing the physical location of the tissue can be classified as a deictic one (Norris 2004). Moreover, as shown in Image 4, the doctor specified the meaning of his verbal response as his left hand maintained a short distance from the model, which precisely simulated the inherent thickness of the ligaments themselves (Kendon 2004). In line 3, to emphasize the palpation method of “feeling,” the doctor once again pinched the model with his left hand’s fingers as in Image 5 (MP2).

Since the patient in Extract 5 did not provide age information, the doctor responded him with the information maximization strategy. He not only presented two treatments from the perspective of traditional Chinese medicine and Western medicine, but also pointed out that whether to choose surgery needed to take the patient’s age into account. He further contended that young people had greater demands for elbow joint function, making surgery a better option. For elderly patients, if the basic function of the elbow joint was preserved, it was not recommended to take the risk of surgery. When defining the basic functions of the elbow joint, the doctor used his own right arm as a multimodal resource to give an embodied response, as shown in Extract 6:
Extract 6 (Corpus ID: OR_B_X5.2)

Figure 11 GEs by the doctor’s right arm
Translation:
D: A radial head fracture mainly affects two aspects: one is the flexion and extension of the elbow joint, and the other is the rotation function of the forearm.
As can be seen from the transcription, in line 1, the doctor’s gestures entered the preparation phase only after he finished saying “radial head fracture”. However, before mentioning the two basic functions of the elbow joint in his verbal response, the doctor had already completed a rotation movement (FR1). In line 2, when introducing the first basic function, the doctor’s elbow joint flexion movement (EF) occurred prior to the corresponding speech content. In line 3 and line 4, the doctor performed three more sets of forearm rotation movements (FR2, FR3, FR4) before the phrase “rotation” was mentioned. Based on this, we can conclude that when doctors use embodied multimodal resources to respond, these resources have their own structure and often appear prior to verbal expressions.
Moreover, embodied multimodal resources may differ from speech not only in terms of sequence but also in terms of pragmatic meaning. In the doctor’s verbal response, it was pointed out that radial head fracture can affect two basic functions of the elbow joint: the flexion and extension function of the elbow, and the rotation function of the forearm. Additionally, the doctor made an embodied response by using his own right arm. Unlike the verbal content of the response, what the doctor demonstrated was not the basic functions of the elbow joint affected by radial head fracture, but those of a healthy person. As shown in Image 6 and Image 7 respectively, the doctor’s elbow joint possesses these two basic functions. In Extract 6, the doctor used his own right arm to demonstrate the speed, force, and maximum range of angles of the movement a normal elbow joint could present. Again, by utilizing nonverbal multimodal resource, the doctor conveyed more information that was not mentioned in his verbal response.
4.3 Project Properties of the Turn-In-Progress: Interactional Patterns and Deviant Cases
Three kinds of projection are involved in the emerging turn, namely establish topic, project upcoming action, and frame the turn (Deppermann 2013). For the first projection, it refers to a newly established topic with respect to the prior talk. Due to the institutional essence of FOMC, doctors stick to the topic of the questions so that the first projection is not involved in our corpora. For the second projection, doctors shifted their gaze from the assistant to the camera or changed their body posture (as in Extract 3) to project the upcoming actions, namely direct answers to the patients, no matter the information provided in the question was insufficient or not. The third projection means the evaluative, emotional, and epistemic stances displayed at the turn-beginning. As shown in Extract 4, in the absence of the patient’s age information, doctors may showcase their k- stance by issuing a disclaimer at the beginning of the conversation and then resort to the information maximization strategy. This “question-answer” adjacent pair which indicates the default identities as “patient-doctor” formulates the interactional pattern of FOMC.
In some cases, however, doctors may violate the pattern by inserting an extra question or instruction to the assistant resulting in a deviant case proposed by Wu (2021), and the instruction may be projected by gestures, as in Extract 7:
Extract 7 (Corpus ID: OP_L_X22)

Figure 12 Line 03–07

Figure 13 Line 08–10
Translation:
Assistant: “Then I just want to ask if surgery is needed for this condition.”
Doctor: Ask him how old he is.
Assistant: Well, he didn’t mention it.
Doctor: He didn’t say so, right?
Assistant: He didn’t mention how old he is.
Doctor: I don’t know. Well, here’s the thing: Generally speaking, based on my judgment, if you have 800 degrees of myopia, macular schisis should only occur when you’re at least 35 years old.
Extract 7 is the continuation of Extract 2 in which the doctor asked a follow-up question instead of providing a direct answer to the polar question – – if a surgery, was needed. This follow-up question, without any politeness marker such as “Can/Could /Would you” (Levinson 2013), is in essence an instruction to the assistant. According to the word sequence of Chinese expression in line 4, however, the whole sentence can be further divided into two parts: the question “how old (is he)”, and the instruction “you ask him briefly”. In other word, it is a compound imperative sentence that the verb of the instruction is preceded by its object.
From the multimodal perspective, this instruction was carried out by the doctor’s verbal expression as well as a pointing gesture. Participants in a conversation may use deictic gestures (or deixis) to specify referents including people or objects in physical world, or to abstract notions (Kendon 2004), or to project self-selection and establish the incipient speakership (Mondada 2007). According to the transcription in line 3, the doctor had prepared to give the instruction as he opened his lips before the assistant finished reading the question. Almost simultaneously, with both his hands were clenched tightly and placed on the table, his index fingers had entered the preparation phase to perform the pointing gesture as shown in Image 8. The stroke of the gesture was accompanied by the verbal question “how old” rather than the pronouns “you” or “him”, and quickly retreated to its home position before the verbal instruction ended, indicating that the gesture was not only a deictic one which was used to hold the floor and to underscore the question, but also a beat gesture to project the instruction. To be specific, the gesture, pointing to the assistant, started before the assistant finished repeating the question (PRE1), reached its apex while the doctor proposed his own question (PO1), and recovered before the doctor finished uttering the instruction (REC1).

We have analyzed in Extract 2 that while listening to the question the doctor gradually balanced his epistemic gradient about the patient’s medical history which was reported in the form of multiple TCUs. As an experienced expert, the doctor had made some assumptions before his verbal response. In line 5, the assistant initiated her reply with an elongated exclamation “we:ll”, indicating that she was searching for the answer. This hesitation also projected the following negative word “doesn’t” as interrupted the assistant’s reply in line 6. At the same time, the doctor used his right hand to prop his cheek up (PC&RH) and the gesture continued across the next five lines, including his two higher-level actions with both the assistant and the patient. Therefore, the gesture projects the initiation of doctor’s verbal response to the patient since the negative word marked the possible epistemic gradient about the patient’s personal condition that the doctor could reach temporarily. The shifting of the interaction target can also be verified by the change in address term for the patient from “him” in line 4 to “you” in line 9. After a silence of about 1 s, the doctor began to respond to the patient in line 8 which was preceded by a self-repair “I don’t know <”. According to the response pattern we found above, the complete sentence may be “I don’t know your age”, which projected his low epistemic stance and the upcoming information maximization strategy. In line 10, however, based on his years of medical practice, the doctor made a specific guess about the patient’s age. It was at the exact moment when the number “35” was uttered that he finally shifted his gaze toward the camera which represented the real patient. This judgement was further emphasized with the second pointing gesture which started with the verb phrase “cái huì chūxiàn (should only occur)”.
5 Discussion and Conclusion
Institutional talk has traditionally been categorized into interactions occurring in formal and non-formal settings, depending on whether the talk takes place in private or public contexts (Drew and Heritage 1992). However, this binary distinction does not fully apply to FOMCs, where the discourse simultaneously involves private exchanges with online patients and public co-broadcasting with the assistants. According to the foreground-background continuum proposed by Norris (2004), these constitute two higher-level actions that coexist within the same communicative event.
This study argues that these two actions are inherently interwoven for two main reasons. First, the private consultation sequence serves as the fundamental interactional unit of the public broadcast. Second, the online patients are disembodied through digital text and subsequently partly re-embodied through the assistants’ mediated performance. Consequently, doctors’ responses in FOMCs---whether verbal or nonverbal---are situated, dynamic, and multimodally constructed. They represent complex multimodal Gestalts (Mondada 2014b) that integrate bodily, spatial, and technological resources in real time, thereby expanding our understanding of institutional interaction in digitally mediated medical contexts.
On the one hand, from the perspective of private consultations, the absence of a physically present patient can make it difficult for doctors to determine the addressee of their talk. For instance, in Extract 4, the doctor consistently employed third-person deixis (e.g., “her”, “herself”) to refer to the patient, whereas in Extract 7, the doctor oriented to the assistant when displaying uptake but directly addressed the patient when providing medical advice. These variations suggest that some doctors treat their responses to patients remain in the background. In such cases, the patient’s condition becomes a topic of discussion negotiated between the doctor and the assistant rather than a direct site of doctor-patient interaction. This finding challenges the notion of patient-centeredness emphasized in previous OMCs studies and highlights the shifting interactional focus in assistant-mediated online consultations.
On the other hand, the interaction between the assistant and the doctor facilitates the doctor’s responses through public streams. The assistant’s role as an embodiment of the patient provides doctor with a real interlocutor. The doctor’s use of multimodal resources, especially HMs and GEs with anatomical models, elicits immediate feedback from the assistant, enabling the doctor to participate more naturally in the conversation. In some cases, the assistant also proposes follow-up questions after the doctor’s response, helping other live viewers gain more medical knowledge. Then, the popular science function of FOMC has also been highlighted.
In terms of conversational structure, most FOMCs follow “question-answer” turn-taking pattern. As the epistemic gradient gradually balances while the doctor listens to the question, they respond to each TCU, with nonverbal modality acting as the dominant mode during this process. During turn transitions, the doctor’s nonverbal modality may precede or coincide with verbal modality. However, in the formal response phase, verbal modality still occupies a dominant position. Nevertheless, in contexts with high modal density, the use of nonverbal modality sometimes supplements content that is difficult to articulate precisely through language. For example, in Extract 5, even though the doctor was not holding an anatomical model of the elbow joint, he was able to explain the diagnostic and treatment methods to the patient through the coordination of verbal expression and GEs.
In conclusion, this study has examined assistant-mediated free online medical consultations to uncover how doctors employ multimodal resources in constructing their responses. Drawing on approximately 9 h of video data annotated with the ELAN tool, it explored the types and functions of multimodal resources used by doctors, their coordination with verbal responses, and the communicative purposes they serve. First, adopting a multimodal interaction analysis approach, the study demonstrated how doctors strategically mobilize various nonverbal resources---including gaze, facial expressions, head movements, vocal particles, body postures, and gestures---to construct responses in close coordination with speech. Second, drawing on Deppermann’s (2013) notion of turn-construction tasks, we found that doctors typically employ multimodal resources to accomplish three major interactional goals: displaying uptake of patient information, managing the projections embedded in patients’ questions, and projecting upcoming actions such as providing additional instructions to assistants. Finally, the study identified a consistent “information maximization” strategy in doctors’ responses, whereby they frequently provide detailed explanations supported by listing gestures and embodied demonstrations. Such practices enhance the clarity, persuasiveness and epistemic authority of their talk, revealing how multimodal resources contribute to professional stance-taking and patient engagement in digitally mediated medical settings.
The findings elaborated in this study may have far-reaching implications, spanning both practical and theoretical spheres. Practically, it provides insights for OMC platform and doctor training. During corpus collection, it was observed that some doctors appeared only from their heads on camera, limiting the visibility of their gestures and postural cues. Given that multimodal resources are crucial for managing turn-taking and elaborating verbal responses, doctors in FOMCs should ensure that their upper bodies remain visible during live streams. Furthermore, since each consultation turn allows patients to submit only one question, they are advised to adopt a detailed elaboration strategy avoid omitting key information, such as age-related details, to enhance communicative efficiency. And theoretically, this study advances multimodal interaction research by explicating how doctors coordinate verbal and nonverbal modalities in assistant-mediated FOMCs. Future research could conduct comparative analyses between online and offline consultations to examine whether doctors employ similar or distinct multimodal strategies when interacting with assistants role-playing as patients versus real patients. Such studies might focus on how both groups respond to doctors’ multimodal cues and how these resources contribute to mutual understanding. Qualitative multimodal analyses involving the same doctors and comparable patient cases may further illuminate these studies, ultimately enabling online assistants to better interact with doctors, and to effectively emulate real patient behaviors in live-streamed consultations.
Appendix: Transcription Conventions
Verbal response
- *___>
-
actions described continues across subsequent lines
- >>_
-
actions described begins before the extract’s beginning
- PTC
-
particle
- C
-
classifier
- NOM
-
nominative (de)
Gaze (GZs)
- ?
-
the direction of gaze is unidentified
- C
-
camera
- B
-
blink
- A
-
assistant
Head movements (HMs)
- T
-
A slight turn of the head
- N
-
nodding
- Down
-
head turning down
- Up
-
head turning up
Vocal particles (VPs)
- H
-
hm
- I
-
inhaling
Body posture (BPs)
- LF
-
lean forward
- LB
-
lean backward
Gesture (GEs)
- PRE
-
preparation
- PVT
-
palms vertical tapping
- PH
-
post-stoke hold
- PDT
-
palms down tapping
- HTI
-
hands turning inwardly
- HTO
-
hands turning outwardly
- PUW
-
palms up waving
- PUT
-
palms up tapping
- PUH
-
palms up hold
- REC
-
recovery
- PO
-
pointing
- PC&RH
-
propping cheek up with right hand
Gesture with anatomical model (also GEs)
- MF
-
fetching the model
- ML
-
lifting the model
- MP
-
palpating the model
- MW
-
waving the model
- MD
-
demarcating the range on the model
- MR
-
returning the model
- LHD
-
left hand dangling in the air
- LHB
-
left hand beat
- EF
-
elbow flexion
- EE
-
elbow extension
- FR
-
forearm rotation
References
Bavelas, J. B., and J. Gerwing. 2011. “The Listener as Addressee in Face-to-Face Dialogue.” International Journal of Listening 25 (3): 178–98. https://doi.org/10.1080/10904018.2010.508675.Search in Google Scholar
Clayman, S. E. 2013. “Turn-Constructional Units and the Transition-Relevance Place.” In The Handbook of Conversation Analysis, edited by J. Sidnell, and T. Stivers, 150–66. Oxford: Wiley-Blackwell.10.1002/9781118325001.ch8Search in Google Scholar
Deng, H., Q. Dou, P. Guo, H. Liu, Y. Xiang, X. Geng, et al.. 2025. “Analysis of the Effects of Three Ovulation Induction Protocols in Patients with Expected Low Ovarian Response.” Journal of Medical Research on Combat Trauma Care 38 (6): 625–31.Search in Google Scholar
Deppermann, A. 2013. “Turn-Design at turn-beginnings: Multimodal Resources to Deal with Tasks of turn-construction in German.” Journal of Pragmatics 46 (1): 91–121. https://doi.org/10.1016/j.pragma.2012.07.010.Search in Google Scholar
Drew, P., and J. Heritage. 1992. Analyzing Talk at Work: An Introduction. Cambridge: Cambridge University Press.Search in Google Scholar
Groeber, S., and E. Pochon-Berger. 2014. “Turns and Turn-Taking in Sign Language Interaction: A Study of Turn-Final Holds.” Journal of Pragmatics 65: 121–36. https://doi.org/10.1016/j.pragma.2013.08.012.Search in Google Scholar
Heath, C. 1986. Body Movement and Speech in Medical Interaction. Cambridge: Cambridge University Press.10.1017/CBO9780511628221Search in Google Scholar
Heath, C., J. Hindmarsh, and P. Luff. 2010. Video in Qualitative Research: Analysing Social Interaction in Everyday Life. London: SAGE Publications Ltd.10.4135/9781526435385Search in Google Scholar
Heritage, J. 2013. “Epistemics in Conversation.” In The Handbook of Conversation Analysis, 370–94. Oxford: Wiley-Blackwell.10.1002/9781118325001.ch18Search in Google Scholar
Herring, S. C. 2004. “Computer-Mediated Discourse Analysis: An Approach to Researching Online Behaviour.” In Designing for Virtual Communities in the Service of Learning, edited by S. A. Barab, R. Kling, and J. H. Gray, 338–76. New York, NY: Cambridge University Press.10.1017/CBO9780511805080.016Search in Google Scholar
Jefferson, G. 2004. “Glossary of Transcript Symbols with an Introduction.” In Conversation Analysis: Studies from the First Generation, edited by G. H. Lerner, 13–31. Amsterdam: John Benjamins Publishing Company.10.1075/pbns.125.02jefSearch in Google Scholar
Kendon, A. 2004. Gesture: Visible Action as Utterance. Cambridge: Cambridge University Press.10.1017/CBO9780511807572Search in Google Scholar
Lee, S.-H. 2013. “Response Design in Conversation.” In The Handbook of Conversation Analysis, edited by J. Sidnell, and T. Stivers, 415–32. Oxford: Wiley-Blackwell.Search in Google Scholar
Levinson, S. C. 2013. “Action Formation and Ascription.” In The Handbook of Conversation Analysis, edited by J. Sidnell, and T. Stivers, 103–30. Oxford: Wiley-Blackwell.10.1002/9781118325001.ch6Search in Google Scholar
Ma, W., and Y. Gao. 2018. “A Study on the same-turn self-repair in Chinese doctor-patient Interaction.” Journal of Foreign Languages 41 (3): 42–55.Search in Google Scholar
Mao, Y. S., and X. Zhao. 2019. “I Am a Doctor, and Here Is My Proof: Chinese Doctors’ Identity Constructed on the Online Medical Consultation Websites.” Health Communication 34 (13): 1645–52. https://doi.org/10.1080/10410236.2018.1517635.Search in Google Scholar
Mao, Y. S., and X. Zhao. 2020. “By the Mitigation One Knows the Doctor: Mitigation Strategies by Chinese Doctors in Online Medical Consultation.” Health Communication 35 (6): 667–74. https://doi.org/10.1080/10410236.2019.1582312.Search in Google Scholar
Mondada, L. 2007. “Multimodal Resources for Turn-Taking: Pointing and the Emergence of Possible Next Speakers.” Discourse Studies 9 (2): 194–225.10.1177/1461445607075346Search in Google Scholar
Mondada, L. 2011. “Understanding as an Embodied, Situated and Sequential Achievement in Interaction.” Journal of Pragmatics 43 (2): 542–52.10.1016/j.pragma.2010.08.019Search in Google Scholar
Mondada, L. 2014a. “Instructions in the Operating Room: How the Surgeon Directs Their Assistant’s Hands.” Discourse Studies 16: 131–61. https://doi.org/10.1177/1461445613515325.Search in Google Scholar
Mondada, L. 2014b. “The Local Constitution of Multimodal Resources for Social Interaction.” Journal of Pragmatics 65: 137–56. https://doi.org/10.1016/j.pragma.2014.04.004.Search in Google Scholar
Norris, S. 2004. Analyzing Multimodal Interaction: A Methodological Framework. New York, NY: Routledge.10.4324/9780203379493Search in Google Scholar
Ren, Y. X., and X. Q. Li. 2023. “A Genre Analysis of Discourse Structure of Online Medical Consultations: Based on the Doctors’ Responses to Patients’ Consultations from Dingxiang Yisheng Website.” Foreign Language and Literature Research (Serial) (02): 2–17.Search in Google Scholar
Ren, Y. X., and X. Q. Li. 2025. “Constructing Patients’ Identities in Online Medical Consultations.” Foreign Language and Literature Studies 42 (1): 36–51.Search in Google Scholar
Sacks, H., and E. A. Schegloff. 2002. “Home Position.” Gesture 2 (2): 133–46. https://doi.org/10.1075/gest.2.2.02sac.Search in Google Scholar
Sacks, H., E. A. Schegloff, and G. Jefferson. 1974. “A Simplest Systematics for the Organization of Turn-Taking for Conversation.” Language 50: 696–735. https://doi.org/10.2307/412243.Search in Google Scholar
Wang, X. J., Y. S. Mao, and Y. Qian. 2021. “From Conditions to Strategies: Dominance Implemented by Chinese Doctors During Online Medical Consultations.” Journal of Pragmatics 182: 76–85. https://doi.org/10.1016/j.pragma.2021.06.011.Search in Google Scholar
Wu, Y. 2021. “A Conversation Analytic Approach to Identity.” Journal of Foreign Languages 44 (3): 49–59.Search in Google Scholar
Zhang, Y. 2021. “Dynamic Power Relations in Online Medical Consultation in China: Disrupting Traditional Roles Through Discursive Positioning.” Chinese Journal of Communication 14 (4): 369–85. https://doi.org/10.1080/17544750.2021.1891556.Search in Google Scholar
Zhang, D. L., and Z. Wang. 2016. “Theoretical Framework for Multimodal Interaction Analysis.” Foreign Languages in China 13 (2): 54–61.Search in Google Scholar
Zhao, Y., and Y. S. Mao. 2025. “Metapragmatic Identity Management in Supportive Interactions Among Patients in Online Health Communities.” Foreign Language and Literature 42 (1): 66–82, 134–135.Search in Google Scholar
Zhao, Y., and F. Yuan. 2023. “Discursive Construction Strategies of Disease Anxiety by Patients in Online Context.” Journal of Zhejiang International Studies University (6): 19–26.Search in Google Scholar
© 2025 the author(s), published by De Gruyter on behalf of Shanghai International Studies University
This work is licensed under the Creative Commons Attribution 4.0 International License.
Articles in the same Issue
- Frontmatter
- Research Articles
- A Corpus-Based Analysis of “Ageing Population” Discourse in Hong Kong Legislative Council Hansards
- A Multimodal Study of Doctors’ Responses in Free Online Medical Consultations
- Detection and Analysis of Depression-Related Language in an Online Community: Machine Learning, Topic Modeling, and Corpus-Linguistic Approaches
- Multimodal Corpus-Based Studies of Language Development: A Plea for Lifespan Linguistics
- Borrowed Akan Discourse-Pragmatic Markers in Ghanaian English
- A Corpus-Assisted Critical Discourse Analysis of Social Actors and Actions in Newspaper Reportage on Domestic Violence in Nigeria
- Book Reviews
- Chinese Films Abroad: Distribution and Translation
- Corpora in Interpreting Studies: East Asian Perspectives
- Doing Corpus Linguistics
- The Law and Critical Discourse Studies
Articles in the same Issue
- Frontmatter
- Research Articles
- A Corpus-Based Analysis of “Ageing Population” Discourse in Hong Kong Legislative Council Hansards
- A Multimodal Study of Doctors’ Responses in Free Online Medical Consultations
- Detection and Analysis of Depression-Related Language in an Online Community: Machine Learning, Topic Modeling, and Corpus-Linguistic Approaches
- Multimodal Corpus-Based Studies of Language Development: A Plea for Lifespan Linguistics
- Borrowed Akan Discourse-Pragmatic Markers in Ghanaian English
- A Corpus-Assisted Critical Discourse Analysis of Social Actors and Actions in Newspaper Reportage on Domestic Violence in Nigeria
- Book Reviews
- Chinese Films Abroad: Distribution and Translation
- Corpora in Interpreting Studies: East Asian Perspectives
- Doing Corpus Linguistics
- The Law and Critical Discourse Studies