Biomedical event extraction using pre-trained SciBERT

Dimmas Mulya; Masayu Leylia Khodra

doi:10.1515/jisys-2023-0021

Article Open Access

Biomedical event extraction using pre-trained SciBERT

Dimmas Mulya and Masayu Leylia Khodra

Published/Copyright: December 31, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 32 Issue 1

Abstract

Biomedical event extraction is applied to biomedical texts to obtain a list of events within the biomedical domain. The best GENIA biomedical event extraction research uses sequence labeling techniques with a joint approach, softmax decoder for event trigger identification, and the BioBERT v1.1 encoder. However, this event extraction model has three drawbacks: tasks are carried out independently, it does not provide special handling of multi-label event trigger labels, and it uses an encoder with vocabulary from non-biomedical domains. We propose to use the pipeline approach to provide forward information sigmoid to address multi-label event trigger labels and alternative BERT encoders with vocabulary from the biomedical domain. The experiment showed that the performance of the biomedical event extraction model increased after changing the encoder, which had been built using a biomedical-specific domain vocabulary. Changing the encoder to SciBERT while still using the joint approach and softmax decoder increased the precision by 4.22 points (reaching 69.88) and resulted in an F1-score of 58.48.

Keywords: biomedical event extraction; pipeline; sequence labeling; SciBERT; multi-label classification

MSC 2010: 68T50

1 Introduction

1.1 Overview

Along with technological developments in both the computational and biological fields, biomedical literature has also increased, making it difficult to retrieve information manually. This poses a new challenge in natural language processing (NLP) for extracting information from biomedical literature, which contains vocabulary not commonly used. Information extraction is a task to retrieve one or several facts in natural language, either in text or in sound, which are then represented in a structured form [1]. NLP is a branch of artificial intelligence that processes natural language, aiming to obtain the information or knowledge contained therein [2,3]. Biomedical NLP, also known as biomedical NLP, operates in the biomedical field and focuses on several tasks related to managing biomedical texts. One of these tasks is extracting information on biomedical events (biomedical event extraction) [4].

1.2 Biomedical event extraction and motivation

Event extraction is a form of information extraction from text data that identifies the entities involved in an event and the relationships between them [5,6]. The primary purpose of event extraction is to obtain 5W1H (“who,” “when,” “where,” “what,” “why,” and “how”) information from text data [5,6]. The use of event extraction, in general, and biomedical-specific domains has differences. In the biomedical domain, event extraction focuses on identifying interactions between biomolecules and other biomolecules [6]. Therefore, extraction often involves named-entity recognition (NER) and relation extraction (RE) [5–8]. Various approaches are used for event extraction, such as machine learning, but the key to each stage is the need for a predefined event type [5,6,8,9]. Event types can be constructed manually by experts or with the help of machine learning and then used for subsequent event extraction [5,6,8,9].

Biomedical event extraction research is growing rapidly in English, and one of the methods that can be used is the classification of sequence labeling [10]. Sequence labeling is a task used to label each token in a sentence. It is commonly applied in NER and part-of-speech tagging to classify the name or role of an entity in each token in a sentence [11]. In the research of biomedical event extraction conducted by Ramponi et al. [10], the implementation of sequence labeling focused on NER with the aid of BERT encoders, specifically BioBERT v1.1, which were trained on biomedical texts. However, research on BERT encoders indicates that BioBERT has lower performance than SciBERT and PubMedBERT in the case of NER [12,13]. This is because SciBERT is trained from the ground-up with a biomedical-specific domain vocabulary, whereas BioBERT initially uses vocabulary from the original BERT and then continues training with text sources from PubMed and PMC.

Ramponi et al. [10] developed a model for event extraction using a multi-task learning approach. The first task classified a multi-class event trigger from each token in a sentence, followed by the second task that performed RE and identified proteins or events related to the trigger event through multi-label classification. Multi-label classification is used when an item can be classified into one or more classes, and it involves assigning a value of 0 or 1 to determine the class’s participation [14]. This is a different concept with multi-class classification, where an item can only belong to one class. In the biomedical domain, multi-label classifications are widely used for various purposes, such as categorizing biomedical literature [15].

However, there are several limitations in the model developed by Ramponi et al. [10], as follows:

In the context of multi-task learning, the applied approach is the joint approach, in which each task is executed independently, resulting in a lack of inter-task information exchange. This phenomenon can lead to overfitting due to recurring word token patterns in different event trigger types. Additionally, the relationship patterns between proteins and other events as entities within an event are also determined by the word token patterns used in the respective event trigger types [16].
In the task of event trigger classification, there exists a phenomenon where a single-word token may represent more than one type of event trigger. For example, the word of “expression” in sentence “PMA induced the expression of both CD14 and CD23 mRNA and protein.” contains two type of event trigger (Gene Expression and Transcription). If this is approached using multi-class classification, where multi-label is regarded as single-class, it can lead to overfitting in classes that are more dominant in occurrence. Furthermore, it can impact the performance of both tasks.
BioBERT v1.1 encoder, which is constructed by continuous training on the original BERT model with biomedical text, exhibits sub-optimal performance compared to BERT models trained entirely from scratch using biomedical text. This disparity can significantly impact the overall model performance, as the model heavily relies on the encoder’s representations provided for the event extraction task.

1.3 Research contribution

In this research, modifications were applied to the previous biomedical event extraction model [10] to address the limitations mentioned in Section 1.2. This research article contributes a modification to the model architecture, enabling the model to handle multi-label classification and replacing the encoder with a newer one. The following are the contributions of this research:

The information between tasks is crucial in determining the output result. Instead of using the joint method, this research adopts the pipeline method to share information between the tasks of NER and RE. Theoretically, this change is expected to improve the performance of extracting biomedical events, as evaluated using the F1-score, precision, and recall metric.
There is a research that explains that the performance of RE with BERT can be improved by applying masking tokens to sentence to avoid overfitting word patterns [10, 17]. In this study, we used the masking tokens to avoid overfitting. The output of the first task (classifying event triggers) will go through masking process first and then will serve as input for the second task.
The performance of event trigger identification tasks, which exhibit multi-label characteristics with a dominant single label, is enhanced by changing the output layer from softmax to sigmoid.
The substitution of the encoder trained on specific domains can improve the semantic representation of each word. Semantic representation plays a critical role in the learning process, leading to an improved performance of the event extraction model.

1.4 Organization of the research

This research article consists of several sections. Section 2 provides a brief report on the related and previous studies relevant to this research’s topic. Section 3 explains the details of tasks involved in biomedical event extraction. Section 4 presents the proposed method, which is the solution to the research problem. Section 5 describes the experimental results, analysis, and discussion. Finally, Section 6 presents the conclusion of this research.

2 Related works

The extraction of biomedical events has a long history. Prior research has heavily relied on feature design and domain knowledge. For instance, neural network methods and syntactic and lexical features were used to explore a dynamic extended tree containing two entities using long short-term memory (LSTM) [18]. The development of biomedical event extraction also involved the use of convolutional neural networks and dependency vectors as input features [19]. Additionally, hybrid deep neural networks have been used for trigger detection, RE, and event construction, showing that incorporating additional knowledge can enhance the performance of biomedical event extraction [19,20].

However, the event extraction models in these studies exhibited some limitations. The methods used focused heavily on the design of features used as input for the event extraction models. Additionally, the techniques applied in the biomedical event extraction models were overly centered on event trigger recognition and RE. Therefore, the results of the event extraction models in these studies may not effectively handle event types with dynamic arguments (e.g., Binding and Regulation).

Currently, the biomedical event extraction approaches can be broadly categorized into two groups: the pipeline approach and the joint approach. The pipeline approach divides the task into several sub-tasks without considering their interactions. For example, some research used a pipeline approach based on support vector machines to extract biomedical events using the shortest path [21]. On the other hand, recent studies have adopted the joint approach to address this problem, using pairwise models, stacked models, and multi-task learning [10,22–24].

Joint multi-task learning has demonstrated superior model performance for extracting biomedical events by independently identifying event triggers and performing RE, then unifying the outputs of each task [10]. However, a drawback of this approach is that independently running tasks cannot provide information to each other during the event extraction process. To overcome these limitations, this research proposes a modification by applying a pipeline approach.

Nevertheless, there are certain drawbacks to a model that uses the pipeline approach. Models using the pipeline method tend not to respond to interactions between sub-tasks, even though they simplify the model. Conversely, models using the joint method focus on predicting triggers and arguments together but may not consider combinations of arguments. Additionally, some models improve performance by incorporating syntax information but not dependencies, even though dependency values contain valuable information for event extraction.

In comparison with these approaches (joint and pipeline), the model used in this study incorporates both the pipeline and joint methods by masking important entities to avoid overfitting during the event extraction process.

3 GENIA biomedical event representation

3.1 Event data structure

The provided data consist of biomedical texts, notations for each entity contained in the text, and notations for events and entities involved in the events. The event types used are limited to nine events (Table 1). The “Theme” of the event represents the entity that is the object of the event, while the “Cause” indicates the entity that influences the occurrence of the event. An example of the data format can be seen in Table 2: “T” represents the Trigger, which can be either the notation for a protein entity or the event trigger, and “E” represents the Event, which includes the biomedical event along with its theme, cause, and event type.

Table 1

Types of events used in GENIA event extraction [25]

Event type	Argument
Gene expression	Theme (Protein)
Transcription	Theme (Protein)
Protein catabolism	Theme (Protein)
Localization	Theme (Protein)
Phosphorylation	Theme (Protein)
Binding	Theme (Protein)+
Regulation	Theme (Protein/Event), Cause (Protein/Event)
Positive regulation	Theme (Protein/Event), Cause (Protein/Event)
Negative regulation	Theme (Protein/Event), Cause (Protein/Event)

Table 2

Example of GENIA event extraction data [20]

Biomedical text	Annotation
This LPA-induced rapid phosphorylation of radixin was significantly suppressed in the presence of C3 toxin, a potent inhibitor of Rho.	T1 (Protein, LPA)
	T2 (Protein, radixin)
	T3 (Protein, C3)
	T4 (Protein, Rho)
	T5 (Phosphorylation, phosphorylate)
	T6 (Positive regulation, induce)
	T7 (Negative regulation, suppress)
	T8 (Negative regulation, inhibit)
	E1 (Type: T5, Theme: T2)
	E2 (Type: T6, Theme: E1, Cause: T1)
	E3 (Type: T7, Theme: E1)
	E4 (Type: T8, Theme: T4, Cause: T3)

Each event type described in Table 1 has different argument characteristics. The first five types of events (Gene expression, Transcription, Protein catabolism, Localization, and Phosphorylation) have only one argument, namely the theme. The event type “Binding” can have more than one theme argument. The last three event types (Regulation, Positive regulation, and Negative regulation) have a theme argument and can also include an optional cause argument. The event-type notations in the data are illustrated in Table 2: (1) “T1” to “T4” represent the protein notations that are entities in this data; (2) “T5” to “T8” represent the event trigger notations; and (3) “E1” to “E4” represent the relationship between event notations and protein notations or other events based on the argument pattern of each event-type. The use of theme and cause arguments in the last three event types allows them to be populated with protein entities or other events (nested events). An example of a nested event can be observed in “E2” and “E3,” where both events have another event (“E1”) as the value of the theme argument. The data usage, as described, involves identifying event types, theme arguments, and cause arguments if there are any values for the cause argument.

3.2 Event structure labeling

The first step in the event representation process is the token labeling process. In this step, three types of labels are used: a label to indicate whether the token is a trigger event (annotated as “dependency” or “d”), another label to indicate the type of relationship between the event triggers (annotated as “relation” or “r”), and a label to indicate the position of the value (protein or another event) based on the argument pattern of the event type (annotated as “head” or “h”). An example of the resulting notation is shown in Figure 1.

Figure 1

Examples of biomedical event annotations on a biomedical sentence [10].

Each token in the sentence shown in Figure 1 has three pre-described labels, and when combined, they create a label in the format ⟨ d , r , h ⟩ . The “ r ” label is assigned a special code to simplify the notation process. For example, in Figure 1, the code “+Reg₊₁” indicates that “This protein entity or event has a relationship with the first +Regulation (Positive regulation) event (denoted by +1) located on the right (denoted by the + sign on the +1 sign).” If the event’s position is the first on the left, the position annotation will be “ − 1 .” Similarly, if the event’s position is the second on the right, the position annotation will be “+2,” and so on. An example of an annotated token is <Expression, Theme, Reg_-1>. In the case of multi-labels, the label annotations of each token are adjusted according to the order. For example, the word “activation” has two “h” labels, resulting in the labeling <+Regulation, [Theme, Cause], [Reg_-1, Reg₊₁]> [10].

4 Proposed method

The proposed method builds upon the joint approach introduced in the study by Ramponi et al. [10], incorporating three key modifications. First, the architecture is transformed from the initial joint method to the pipeline method. This change allows for sequential training of two modules: the first module identifies event triggers, and the second module performs relationship extraction. By leveraging information from one task to the next, this pipeline approach enhances the overall performance of biomedical event extraction. The architectural design of the proposed model is illustrated in Figure 2.

Figure 2

Proposed model architecture.

Second, the event trigger identification task undergoes two adjustments. Instead of using the softmax decoder, the model now uses the sigmoid decoder to handle the multi-label characteristics of event trigger labels effectively. Additionally, an event-masking mechanism is introduced to prevent overfitting of event patterns based on specific words or word sets acting as event triggers. Detailed information about event masking is provided.

Finally, the BERT encoder used in the original joint approach is replaced with BERT models that have been trained with biomedical text and domain-specific vocabulary. Specifically, SciBERT and PubMedBERT are selected due to their superior performance compared to the initial BioBERT model. These encoder substitutions lead to improved semantic word representations, further enhancing the performance of the event extraction model.

4.1 Pipeline method implementation

Applying the joint method to run event extraction provides a model that treats each task individually, without access to information from other tasks. However, effective biomedical event extraction requires using all available information to achieve excellent performance. Therefore, we propose a change in the decoder architecture, transitioning from the initial joint method to the pipeline method. In this approach, module one of the pipeline identifies event triggers, and module two performs relationship extraction. By adopting a sequential training approach and allowing information flow between tasks, the pipeline method ensures that relevant information from one task is used in subsequent tasks. Details of the architectural changes are illustrated in Figure 3.

Figure 3

Change of joint method to pipeline method.

Figure 3 illustrates the transition from the joint method used in the study by Ramponi et al. [10] to the pipeline method proposed in this study. In the joint method, each task is trained and executed independently, lacking access to information from other tasks. In contrast, for successful extraction of biomedical events, comprehensive information is essential. Therefore, we propose modifications to the event extraction model using the pipeline method, where each task is sequentially trained and used. The output of module one, the event trigger identification module, serves as a reference for event masking in module two, the relationship extraction module. Through gradual training and inter-task information exchange, each task can contribute additional information to enhance the overall event extraction performance.

4.2 Task decoder changes identify event triggers and event masking

Handling the event trigger identification task (label “d”) with multi-label characteristics is currently accomplished through multi-class classification using softmax decoders. However, this approach can lead to data imbalance, with single-label data dominating over multi-label data. To address this issue, we modify the decoder for the event trigger identification task by switching from softmax to sigmoid. This change enables the model to effectively handle the multi-label characteristics of event trigger labels. Additionally, we introduce an event-masking mechanism to prevent overfitting of event patterns based on specific words or word sets acting as event triggers. The event-masking system is detailed in Table 3.

Table 3

Event masking for module one output

Group of events	Event label	Event masking
Simple event (theme one argument and is worth the protein)	Gene_expression	$SVT$
	Localization
	Protein_catabolism
	Phosphorylation
	Transcription
Binding (one or more theme arguments and protein value)	Binding	$BVT$
Complex event (theme arguments must exist and cause arguments are optional and can be a protein or another event)	Regulation	$EVT$
	Positive_regulation
	Negative_regulation
A combination of simple events and bindings	Example: Localization and binding	$SBVT$
A combination of simple event and complex event	Example: Localization and regulation	$SEVT$
A combination of binding and complex event	Example: Binding and regulation	$BEVT$
A combination of simple event, binding, and complex event	Example: Localization, binding, and regulation	$SBEVT$

Table 3 explains the event-masking process based on the results from the first module of the pipeline, where all event triggers undergo event masking. The system is designed to handle events with similar argument patterns. For instance, events such as Gene expression, Transcription, Protein catabolism, Phosphorylation, and Localization, which share the same argument pattern (having only a theme argument), are event-masked as S V T . The same approach is applied to Binding events and Regulation, Positive regulation, and Negative regulation events. Events with multiple class values receive special masking events based on the following rules: (1) if a token has more than one event label class value, all from the same event group, the masking event corresponds to the associated event group, and (2) if a token has more than one event label class value, all from different event groups, the event masking corresponds to the associated event group join. An example of event masking is provided in Table 4.

Table 4

Example of event masking

Text	Recently, data suggest that Runx1 may also be involved in regulating Il17 transcription , functioning in complex with RORgammat to activate transcription (Zhang dkk., 2008).
Annotation	O O O O O Protein O O O O O Regulation Protein Transcription O O O O O Protein O O O O O O O O O O O
Masked text	Recently , data suggest that $PROTEIN$ may also be involved in $EVT$ $PROTEIN$ $SVT$ , functioning in complex with $PROTEIN$ to activate transcription (Zhang dkk., 2008).

4.3 BERT encoder substitution

BioBERT, as an encoder, provides a suboptimal representation of word sequences because it was built from the original BERT and continued its training process with biomedical literature text data. Consequently, the vocabulary used in BioBERT is not specifically tailored to the biomedical domain. To address this limitation, we propose substituting the BERT encoder used in the initial joint approach with BERT models that have been trained with biomedical text and domain-specific vocabulary. Specifically, we explore SciBERT and PubMedBERT as alternatives to BioBERT, as they have demonstrated superior performance and vocabularies better aligned with the biomedical domain. Using these BERT models is expected to yield more effective semantic word representations, thereby enhancing the performance of the event extraction model. We conducted experiments to determine the BERT model that best provides performance values for our proposed method. A comparison of the vocabularies used in BioBERT, SciBERT, and PubMedBERT is shown in Table 5.

Table 5

Comparison of biomedical term vocabularies in BioBERT, SciBERT, and PubMedBERT [12]. BioBERT still used BERT’s original vocabulary. Meanwhile, SciBERT and PubMedBERT have built their vocabulary through training from scratch. The v symbol indicates that the biomedical term exists in related BERT’s vocabularies

Biomedical term	BioBERT	SciBERT	PubMedBERT
Diabetes	v	v	v
Leukemia	v	v	v
Lithium	v	v	v
Insulin	v	v	v
DNA	v	v	v
Promoter	v	v	v
Hypertension	Hyper-tension	v	v
Nephropathy	Ne-ph-rop-athy	v	v
Lymphoma	L-ym-ph-oma	v	v
Lidocaine	Lid-oca-ine	v	v
Oropharyngeal	Oro-pha-ryn-ge-al	Or-opharyngeal	v
Cardiomyocyte	Card-iom-yo-cy-te	Cardiomy-ocyte	v
Chloramphenicol	Ch-lor-amp-hen-ico-l	Chlor-amp-hen-icol	v
RecA	Rec-A	Rec-A	v
Acetyltransferase	Ace-ty-lt-ran-sf-eras-e	Acetyl-transferase	v
Clonidine	Cl-oni-dine	Clon-idine	v
Naloxone	Na-lo-xon-e	Nal-oxo-ne	v

5 Experiments and analysis

A series of experimental trials in this study aimed to obtain a biomedical event extraction model with better performance than the study by Ramponi et al. [10] using GENIA Event Extraction 2011 validation data. The research [10] achieved a model performance score of 65.04 with the F1-score evaluation metric. However, as no performance details of each event were provided, the event extraction model was replicated. The performance value obtained from the replication of the previous biomedical event extraction model was 63.69 in the F1-score evaluation metric. Therefore, this experiment’s main focus is to build a biomedical event extraction model that surpasses the previous model’s performance value of 63.69 in the F1-score evaluation metric. At the end of the experiments, the test set was successfully run on the best model, obtaining an F1-score value of 58.48. Although achieving a relatively low F1-score value, the precision obtained managed to rival the previous research, reaching 69.88.

The experimental scenario is carried out through several steps. In the first step, the experiment focuses on using a pre-trained BERT model that performs better than BioBERT. The aim is to create a superior encoder compared to previous studies. The second step involves modifying the decoder into a pipeline. For this study, the biomedical BERT models used include BioBERT v1.1 as a reference base, along with BioBERT v1.2, PubMedBERT, and SciBERT [12,13,26]. Based on the best modification results, hyperparameter tuning experiments will be conducted. The hyperparameters for the experiments will be determined based on the analysis results. BioBERT was reused due to modifications in the decoder architecture, necessitating retraining to ensure fair and comparable experimental results.

5.1 Experiment results on module one of the pipeline

The substitution of the BERT encoder was carried out using the same configuration parameters (Figure 4) as the previous study, with 20 epochs in module one using a sigmoid decoder. The experimental results, compared with the performance obtained using a softmax decoder, are shown in Table 6.

Figure 4

Default parameter in research [10].

Table 6

Result of module one of the pipeline experiment (F1-score)

Encoder	Sigmoid	Softmax
BioBERT v1.1	52.62	58.45
BioBERT v1.2	45.37	62.60
SciBERT	47.02	70.28
PubMedBERT	50.28	62.72

Table 6 demonstrates the experimental results, highlighting the decrease in performance when applying multi-label classification using sigmoid decoders compared to multi-class classification using softmax decoders for predicting event triggers (label “d”) on tokens in module one. The average decrease in performance, as measured by the F1-score evaluation metric, was 14.69 for each BERT encoder. The shortcomings of multi-class classification using softmax decoders are evident in the F1-score values for event trigger labels with multiple label values. These results demonstrate poor performance when considered as a case of multi-class classification using softmax decoders. The limited number of tokens with event triggers having multiple label values contributes to this disparity. Thus, if the multi-label value is transformed into the perspective of multi-class classification using a softmax decoder, it can result in an imbalanced class, as each multi-label event trigger is transformed into a new event trigger class. However, the experimental results also reveal that the SciBERT encoder produced the best performance, achieving an F1-score value of 70.28 using a softmax decoder. This superior performance can be attributed to SciBERT being specifically trained in the biomedical domain compared to BioBERT. Additionally, both SciBERT and BioBERT are trained using biomedical literature texts, supplemented by text sources outside the biomedical domain. Consequently, using SciBERT with the default architecture and hyperparameters yields a higher performance value of 11.83 in the F1-score evaluation metric.

5.2 Experimental results on module two of the pipeline

The second experiment focused on module two of the pipeline, which performs RE tasks. Similar to module one, the experimental scenario involved replacing the BERT encoder and using the same configuration parameters (Figure 4) as the previous study, with 20 epochs in module two using a sigmoid decoder. The experimental results were compared with the performance obtained using the joint method described in the study by Ramponi et al. [10].

Table 7 presents the experimental results, revealing a decrease in event extraction performance when applying the pipeline method compared to the joint method. Using the F1-score evaluation metric, the average decrease in performance for each BERT encoder was 15.18. Once again, the SciBERT encoder demonstrated the best performance, achieving an F1-score value of 64.50 using the joint method. This superior performance can be attributed to SciBERT being specifically trained in the biomedical domain compared to BioBERT. Additionally, both SciBERT and BioBERT are trained using biomedical literature texts, supplemented by text sources outside the biomedical domain. Using the default architecture and hyperparameters, SciBERT yielded a higher performance value of 0.81 in the F1-score evaluation metric. Consequently, the biomedical event extraction model using the SciBERT encoder with the joint method was selected for further experiments involving hyperparameter tuning.

Table 7

Result of module two of the pipeline experiment (F1-score)

Encoder	Joint	Pipeline
BioBERT v1.1	63.69	51.07
BioBERT v1.2	63.12	48.78
SciBERT	64.50	49.24
PubMedBERT	59.19	40.74

5.3 Hyperparameter tuning experiment results

Based on the modifications made in the three previous steps (pipeline method implementation, decoder replacement for event trigger identification, and BERT encoder substitution), it was observed that modifying the BERT encoder led to improvements in model performance. This observation is supported by the experimental results presented Tables 6 and 7. However, changing the decoder to a sigmoid activation did not result in improved performance. On the other hand, using a softmax decoder with the SciBERT encoder led to increased model performance values. Given these findings, the experiment focused on hyperparameter tuning specifically related to BERT.

To assist in tuning the hyperparameters, the model base used in the study by Ramponi et al. [10] was considered. The hyperparameters closely related to BERT are BERT dropout and mask probability [27]. BERT dropout is a dropout layer applied after embedding the BERT layer, while mask probability refers to the random masking mechanism (transforming tokens into “[MASK]”). Both hyperparameters are known to mitigate overfitting in the model during task execution. To determine the optimal hyperparameter values, a grid search strategy was employed using the hyperparameter space described in Table 8.

Table 8

Hyperparameter space for BERT dropout and mask probability

Hyperparameters	Value space	Number of experiments
BERT dropout	0.1, 0.2	2
Mask probability	0.1, 0.15, 0.2	3
Total experiments		6

Six scenarios were tested using the hyperparameters of the best model, i.e., the joint method, softmax decoder for event trigger identification tasks, and SciBERT encoders, as specified in Table 6. The evaluation metric used to assess the experiment’s results was the F1-score. The experimental results are summarized in Table 9. Table 9 displays the results of the grid search

Table 9

Hyperparameter tuning experiment results

BERT dropout	Mask probability	F1-score
0.1	0.1	64.50
0.1	0.15	63.28
0.1	0.2	62.70
0.2	0.1	64.03
0.2	0.15	62.86
0.2	0.2	63.28

Table 9 shows the results of grid search experiments for hyperparameters of the best models. The experimental results show that the best performance value is generated in the default parameters, namely, BERT dropout with a value of 0.1 and mask probability with a value of 0.1. The F1-score result with the best model was 64.50. Hyperparameter tuning experiments did not provide significant changes. In the case of BERT dropout worth 0.1, the result will be even worse if the mask probability is increased. It may be because the sentence sequence used is too difficult to obtain semantics because the application of mask probability with a higher value and the presence of a protein masking system and event masking makes it difficult for models to learn the semantics of the sentence. This is the same with the scenario where the BERT dropout is worth 0.2. However, on models with a BERT dropout worth 0.2, there is an increase in the mask probability of 0.2. This may be due to the generalization of the BERT encoder input and the BERT encoder output in a balanced way.

5.4 Best result model from across experiments

Based on the results of the experiment scenario that has been successfully run, the biomedical event extraction model was successfully built with a performance value higher than the replication model performance value [10]. The performance value was 64.50 (Table 10) in the F1-score evaluation metric, with an increased value from the previous model of 0.81. Table 11 shows the comparison of the performance results of the best models and replication models [10] in performing biomedical event extraction using the distribution of validation data from GENIA Event Extraction data. Table 12 shows the comparison of the performance results of the best models and previous work with the GENIA 2011 test set. Our model successfully achieved the highest precision result among the previous works, but the recall result dropped, resulting in a lower F1-score than previous works. The details of the test set performance are shown in Table 13.

Table 10

Best event extraction model performance with evaluation set

Event	Recall	Precision	F1-score
Transcription	63.92	81.45	71.63
Protein catabolism	86.96	90.91	88.89
Phosphorylation	75.68	95.45	84.42
Localization	86.57	89.23	87.88
SVT-TOTAL	77.62	88.92	82.89
Binding	49.06	65.12	55.96
Regulation	39.38	55.02	45.91
Positive regulation	47.75	64.46	54.86
Negative regulation	49.04	66.76	56.55
EVT-TOTAL	46.71	63.55	53.84
ALL-TOTAL	57.54	73.37	64.50

Table 11

Comparison of the best model with baseline research model performance

Event	Baseline model [10]	Best model
Gene expression	82.71	84.26
Transcription	66.90	71.63
Protein catabolism	80.95	88.89
Phosphorylation	89.32	84.42
Localization	82.17	87.88
SVT-TOTAL	81.10	82.89
Binding	57.19	55.96
Regulation	46.37	45.91
Positive regulation	53.52	54.86
Negative regulation	57.42	56.55
EVT-TOTAL	53.41	53.84
ALL-TOTAL	63.69	64.50

Table 12

Comparison of the best model and previous work with GENIA 2011 test set

Works	Method	Precision	Recall	F1-score
Björne and Salakoski [19]	TEES-CNN	69.45	49.94	58.10
Li et al. [28]	KB-driven Tree-LSTM pipeline	67.01	52.14	58.65
Wang et al. [29]	QA with BERT	59.33	57.37	58.33
Huang et al. [30]	GEANet with SciBERT	64.61	56.11	60.06
Ramponi et al. [10]	Multi-task BEESL with BioBERT	69.72	53.00	60.22
Trieu et al. [31]	DeepEventMine with SciBERT	71.71	56.20	63.02
Zhao et al. [32]	HANK(K=2) with Glove	71.73	53.21	61.10
Su et al. [16]	GCN on Dependency Tree with SciBERT	65.66	60.85	63.15
Our best model	—	69.88	50.28	58.48

Table 13

Best event extraction model performance with test set

Event	Recall	Precision	F1-score
Gene expression	73.55	88.48	80.33
Transcription	67.82	70.24	69.01
Protein catabolism	66.67	90.91	76.92
Phosphorylation	83.24	94.48	88.51
Localization	61.26	83.57	70.69
SVT-TOTAL	72.50	86.39	78.83
Binding	39.10	63.58	48.42
Regulation	32.21	52.77	40.00
Positive regulation	38.60	59.70	46.89
Negative regulation	40.63	54.98	46.73
EVT-TOTAL	38.06	57.42	45.78
ALL-TOTAL	50.28	69.88	58.48

5.5 Best result model error and analysis

The successfully built biomedical event extraction model exhibits several error patterns in both the first and second modules of the pipeline. Some of these errors are consistent with those observed in the model from the study by Ramponi et al. [10], including: (1) under-prediction, where the event trigger that should have been identified is missed; (2) over-prediction, where a non-existing event trigger is falsely identified; and (3) wrong-prediction, where an incorrect event trigger is identified. However, the proposed method introduces additional errors. Notably, the first module of the pipeline tends to produce event trigger patterns that do not exist. This phenomenon is likely due to the sigmoid decoder’s broader output range, which allows for more variations than softmax. An illustrative example of this error is presented in Table 14.

Table 14

Examples of disadvantages of a sigmoid decoder in positive regulation and regulation event trigger

Text	To investigate whether Runx-mediated induction of Foxp3 is dependent on the expression of CBFbeta, naive CD4+ CD8- T cells from CbfbF/F CD4-cre and control CbfbF/+ CD4-cre mice were stimulated with anti-CD3/28 mAbs, retinoic acid, and increasing concentrations of TGF-beta
Input (Protein Masked)	To investigate whether $PROTEIN$ – mediated induction of $PROTEIN$ is dependent on the expression of $PROTEIN$, naive $PROTEIN$ + $PROTEIN$ – T cells from $PROTEIN$ F/F $PROTEIN$ and control $PROTEIN$ F/+ $PROTEIN$ mice were stimulated with anti- $PROTEIN$ / $PROTEIN$ mAbs, retinoic acid, and increasing concentrations of $PROTEIN$.
Ground Truth	O O O Protein O Positive_regulation Gene_expression O Protein O Positive_regulation O O Gene_expression O Protein O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
Identification Result	O O O Protein O Positive_regulation Gene_expression O Protein O Positive_regulation////Regulation O O Gene_expression O Protein O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

In the training and validation data, no event triggers positive regulation and regulation that lead to the same word. However, due to the variation of the sigmoid decoder output, missing event trigger combinations may occur.

The deficiencies observed in the second module of the pipeline can be attributed to errors in the first module, as inaccuracies in the initial output can have significant repercussions on the second module’s performance. Errors in the second module vary depending on the event type, as each event type entails distinct argument patterns. Nonetheless, there are four recurring patterns. First, RE may not be appropriately performed when the relevant protein entity does not exist in the same sentence. As the model receives only a single sentence as input from the entire biomedical text, this limitation can impede accurate RE. Second, the model may incorrectly select protein entities or other events in the biomedical sentence as the values of event arguments. This arises from the dynamic nature of argument value locations, necessitating a more profound semantic understanding of the sentence to rectify this issue. Third, the model may fail to predict the number of themes for binding events. This occurs as binding events often entail varying theme arguments, although they typically comprise only one or two themes. A more comprehensive investigation is warranted to enhance the accurate prediction of binding event themes. Finally, the model may overlook the presence of cause arguments in complex events. This happens because cause arguments are optional and can accept other events as values, leading to the extraction failure of many complex events in cases where the ground truth of the event involves a cause with another event value.

6 Conclusion

The conclusion of this study, based on the conducted experiments, reveals that certain modifications to the biomedical event extraction model yielded improvements, while others did not. The application of the pipeline method to replace the joint method and the substitution of the sigmoid decoder for softmax in the event trigger identification task did not enhance the performance of the biomedical event extraction models as reported in research [10]. However, replacing the BERT encoders with biomedical-specific ones resulted in improved model performance. Specifically, changing the softmax decoder to the sigmoid decoder improved the performance of the event trigger identification task, but it did not lead to an overall increase in the event extraction model’s performance. The experimental results indicate that the SciBERT encoder demonstrated the best performance in identifying event triggers or dependency labels, achieving an F1-score of 70.28 in the evaluation metric. When compared to the replication model from the previous study, this modified model, with a joint method and a softmax decoder for event trigger identification tasks, performed significantly better, obtaining an F1-score of 64.50 with the validation set and a higher precision of 69.88 with the test set.

Acknowledgments

We would like to express our heartfelt gratitude to the people in the School of Electrical Engineering and Informatics, Bandung Institute of Technology for their support of this research.

Funding information: Communication of this research is made possible through monetary assistance from the School of Electrical Engineering and Informatics, Bandung Institute of Technology via Publication Fund.
Author contributions: The authors declare that they contributed equally to the conception and design of the study.
Conflict of interest: The authors have no conflicts of interest to declare. The authors certify that the submission is an original work and is not under review at any other publication. All authors have seen and agree with the manuscript’s contents.
Data availability statement: The dataset used for this research is available online and has a proper citation within the article’s contents.

References

[1] Ji H. Encyclopedia of database system. Boston: Springer US; 2009. Search in Google Scholar

[2] Allen JF, Encyclopedia of computer science. GBR: John Wiley and Sons Ltd.; 2003. Search in Google Scholar

[3] Cohen KB and Demner-Fushman D. Biomedical natural language processing. Philadelphia: John Benjamins Publishing Company; 2014. Search in Google Scholar

[4] Erhardt RA-A, Schneider R, Blaschke C. Status of text-mining techniques applied to biomedical text. Drug Discovery Today. 2006;11:315–25. 10.1016/j.drudis.2006.02.011. Search in Google Scholar PubMed

[5] Liu J, Min L, Huang X. An overview of event extraction and its applications. arXiv:2111.03212. [cs.CL]. 2021.Search in Google Scholar

[6] Xiang W, Wang B. A survey of event extraction from text. In: IEEE Access. Vol. 7. 2019. p. 173111–37. 10.1109/ACCESS.2019.2956831. Search in Google Scholar

[7] Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T. Complex event extraction at PubMed scale. Bioinformatics. 2010;26(12):i382–90. 10.1093/bioinformatics/btq180. Search in Google Scholar PubMed PubMed Central

[8] Li Q, Li J, Sheng J, Cui S, Wu J, Hei Y, et al. A survey on deep learning event extraction: approaches and applications. IEEE Trans Neural Networks Learn Syst. 2022;1–21. 10.1109/TNNLS.2022.3213168. Search in Google Scholar PubMed

[9] Zhan L, Jiang X. Proceedings of 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, Mar. 15–17, 2019. Chengdu, China: IEEE, 2019; 2121. 10.1109/ITNEC.2019.8729158Search in Google Scholar

[10] Ramponi A, Van Der Goot R, Lombardo R, Plank B. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Nov. 16–20, 2020. Stroudsburg: Association for Computational Linguistics; 2020. p. 5357. 10.18653/v1/2020.emnlp-main.431Search in Google Scholar

[11] Jurafsky D, Martin JH. Speech and language processing, 2nd ed. New Jersey: Prentice Hall, 2014. Search in Google Scholar

[12] Beltagy I, Lo K, Cohan A. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Nov. 3–7, 2019, Hong Kong, China. Stroudsburg: Association for Computational Linguistics; 2019. p. 3613. Search in Google Scholar

[13] Gu Y, Tinn R, Cheng H Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc. 2021;3(1):1–23. 10.1145/3458754. Search in Google Scholar

[14] Read J, Pfahringer B, Holmes G, Frank E. Classifier chains for multi-label classification. Mach Learn. 2009;254–69. 10.1007/s10994-011-5256-5. Search in Google Scholar

[15] Chen Q, Allot A, Leaman R, Islamaj R, Du J, Fang L, et al. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database 2022;2022:baac089. 10.1093/database/baac069. Search in Google Scholar PubMed PubMed Central

[16] Su F, Zhang Y, Li F, Ji D. Balancing precision and recall for neural biomedical event extraction. IEEE/ACM Trans Audio Speech Language Process 2022;30:1637–49. 10.1109/TASLP.2022.3161146. Search in Google Scholar

[17] Chen T, Wu M, Li H. A general approach for improving deep learning-based medical relation extraction using a pre-trained model and fine-tuning. Database. 2019;2019:baz116. 10.1093/database/baz116. Search in Google Scholar PubMed PubMed Central

[18] Li L, Zheng J, Wan J, Huang D, Lin X. 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec. 15–18, 2016. Shenzhen, China: IEEE; 2016. p. 739. Search in Google Scholar

[19] Björne J, Salakoski T. Proceedings of the BioNLP 2018 Workshop, Jul. 19, 2018, Melbourne, Australia. Stroudsburg: Association for Computational Linguistics; 2018. p. 98. Search in Google Scholar

[20] Rao S, Marcu D, Knight K, Iii HD. Proceedings of the BioNLP 2017 workshop, Aug. 4, 2017, Vancouver, Canada. Stroudsburg: Association for Computational Linguistics; 2017. p. 126. 10.18653/v1/W17-2315Search in Google Scholar

[21] Miwa M, Thompson P, Ananiadou S. Boosting automatic event extraction from the literature using domain adaptation and conference resolution. Bioinformatics. 2012;28(13):1759–65. 10.1093/bioinformatics/bts237. Search in Google Scholar PubMed PubMed Central

[22] Liu X, Bordes A, Grandvalet Y. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Apr. 26-30, 2019, Gothenburg, Sweden. Stroudsburg: Association for Computational Linguistics; 2019. p. 692. Search in Google Scholar

[23] Riedel S, McClosky D, Surdeanu M, McCallum A, Manning CD. Proceedings of BioNLP Shared Task 2011 Workshop, Jun. 24, 2011, Portland, Oregon, USA. Stroudsburg: Association for Computational Linguistics; 2011. p. 51. Search in Google Scholar

[24] Vlachos A, Craven M. Proceedings of BioNLP Shared Task 2011 Workshop, Jun. 24, 2011, Portland, Oregon, USA. Stroudsburg: Association for Computational Linguistics; 2011. p. 36. Search in Google Scholar

[25] Kim J-D, Wang Y, Takagi T, Yonezawa A. Proceedings of BioNLP Shared Task 2011 Workshop, Jun. 24, 2011, Portland, Oregon, USA. Stroudsburg: Association for Computational Linguistics; 2011. 7. Search in Google Scholar

[26] Lee J, Yoon W, Kim S, Kim D, Kim S, So CH et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36:1234–40. 10.1093/bioinformatics/btz682. Search in Google Scholar PubMed PubMed Central

[27] Kondratyuk D, Straka M. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Nov. 3–7, 2019, Hong Kong, China. Stroudsburg: Association for Computational Linguistics; 2019. p. 2779–95. Search in Google Scholar

[28] Li D, Huang L, Ji H, Han J. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Volume 1 (Long and Short Papers), Jun. 2–7, 2019, Minneapolis, Minnesota. Stroudsburg: Association for Computational Linguistics; 2019. p. 1421.Search in Google Scholar

[29] Wang DX, Weber L, Ulf L. Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, Nov. 20, 2020. Online. Stroudsburg: Association for Computational Linguistics; 2020. p. 88.Search in Google Scholar

[30] Huang KH, Yang M, Peng N. Findings of the Association for Computational Linguistics: EMNLP, Nov. 16–20, 2020. Online. Stroudsburg: Association for Computational Linguistics; 2020. p. 1277.10.18653/v1/2020.findings-emnlp.114Search in Google Scholar

[31] Trieu H-L, Tran TT, Duong KNA, Nguyen A, Miwa M, Ananiadou S. DeepEventMine: end-to-end neural nested event extraction from biomedical texts. Bioinformatics. 2020;26(19):4910–7. 10.1093/bioinformatics/btaa540.Search in Google Scholar PubMed PubMed Central

[32] Zhao W, Zhao Y, Jiang X, He T, Liu F, Li N. Efficient multiple biomedical events extraction via reinforcement learning. Bioinformatics. 2021;37(13):1891–9. 10.1093/bioinformatics/btab024.Search in Google Scholar PubMed

Received: 2023-02-13

Revised: 2023-09-30

Accepted: 2023-11-04

Published Online: 2023-12-31

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jisys-2023-0021

Keywords for this article

biomedical event extraction; pipeline; sequence labeling; SciBERT; multi-label classification

Creative Commons

BY 4.0