Home Application and research of English composition tangent model based on unsupervised semantic space
Article Open Access

Application and research of English composition tangent model based on unsupervised semantic space

  • Rihong Tang EMAIL logo
Published/Copyright: May 22, 2024
Become an author with De Gruyter Brill

Abstract

Nowadays, major enterprises and schools vigorously promote the combination of information technology and subject teaching, among which automatic grading technology is more widely used. In order to improve the efficiency of English composition correction, the study proposes an unsupervised semantic space model for English composition tangent, using a Hierarchical Topic Tree Hybrid Semantic Space to achieve topic representation and clustering in English composition; adopts a feature dimensionality reduction method to select a set of semantic features to complete the optimization of the feature semantic space; and combines the tangent analysis algorithm to achieve intelligent scoring of English composition. The experimental data show that the accuracy and F-value of the English composition tangent analysis method based on the semantic space are significantly improved, and the Pearson correlation coefficient between the unsupervised semantic space English composition tangent model and the teacher’s manual grading is 0.8936. The results show that the unsupervised semantic space English composition tangent model has a higher accuracy rate, is more applicable, and can efficiently complete the English composition grading task: essay review task.

1 Introduction

As the popularity of English has increased, the ability to teach English has gradually improved and students’ English standards have also improved significantly [1]. Writing occupies an important place in the English learning process, reflecting students’ ability to use language and express themselves in writing. Students’ English composition reflects the extent to which they have mastered the basics of vocabulary and grammar, as well as their overall mastery of sentence and paragraph structure and the logic of the text [2,3]. English composition writing is commonly found in subject examinations, Level 4 and 6 examinations and major English competitions, and is a comprehensive reflection of students’ English writing ability. Many experts and scholars agree that writing is an important way of assessing students’ ability to use language. However, there are many problems with the current traditional English teaching model, such as the overwhelming number of students resulting in a heavy task of reviewing essays, and the difficult task of teachers resulting in untimely feedback on review information, all of which can lead to a failure to improve students’ writing skills quickly and efficiently [4]. To address these problems, researchers have proposed an automated English grading system that combines technologies such as natural language processing and machine learning to reduce teachers’ physical and mental workload [5,6]. However, this field still faces some challenges, such as how to accurately evaluate students’ writing abilities and how to handle complex language structures and semantics. Therefore, this study designed a semantic space English composition topic analysis model based on unsupervised methods. The main innovation lies in the use of relational triples as carriers for topic clustering and distributed representation, and proposed topic analysis algorithms, topic coherence algorithms, and topic viewpoint algorithms for multi-dimensional topic analysis of English composition content. Through experimental verification, this model has high accuracy and application value in scoring the quality of relevance to the topic. The main contribution of the research is to propose a method for constructing a hierarchical topic tree (HTT) mixed semantic space and to extend topic relevance analysis from shallow semantic analysis to potential topic semantic analysis, improving the accuracy and granularity of topic relevance semantic analysis.

2 Related works

2.1 Semantic space analysis

Semantic space is an important concept in the study of communication effectiveness, and a prerequisite for communication to be realized is a common semantic space between the transmitter and the receiver. Xiao et al. addressed the problem of low accuracy of scene migration by proposing a simulated realistic decision model based on feature semantic space, combining environment representation, policy optimization, and intelligent decision modules to shorten the difference between real and virtual scenes. It is verified that the method has good stability in practical applications [7]. To optimize the semantic space of text, Kherwa et al. adopted the three-level weight model to distribute the weight of terms, documents, and corpora, so as to achieve the generation of semantic similarity and context clustering [8]. Yu’s research group addressed the convolutional neural network neural-style conversion problem, proposed a multi-scale style conversion algorithm based on deep semantic matching, combined with spatial segmentation and contextual illumination information to construct a deep semantic space, and used the loss function of nearest neighbour search to optimize the effect of deep-style migration, and the experimental results show that the algorithm synthesizes a more reasonable spatial structure image [9]. To optimize the conversion method of semantic space, Yu proposed a semantic space-based and automatic bibliographic classification algorithm based on conversion, combining text preprocessing and word vectors to achieve automatic classification of bibliographic semantic vectors. Experimental data show that the accuracy of this algorithm is higher than that of traditional classification algorithms [10]. Orhan et al. proposed an embedding method for learning word vectors through weighted semantic relations, finding the best weight for them by semantic relations and adjusting the Euclidean distance to obtain word vectors of synonymous sets. Experimental results show that the method is able to find word-level semantic similarities and weights [11].

2.2 Automatic scoring technology

With the development of the times and the progress of technology, major enterprises and schools vigorously promote the combination of information technology and subject teaching, forming the concept of Internet + education, among which the application of automatic scoring technology is more common. Zhao used English automatic scoring algorithm and English sentence feature scoring algorithm to achieve intelligent online scoring and elegant sentence extraction of English composition, and the experimental results show that the algorithm can reduce the workload of English and the experimental results show that the algorithm can reduce the workload of scoring English essays [12]. Wang et al. proposed an improved P-means-based automatic scoring algorithm for Chinese fill-in-the-blank questions, combining semantic lexicon matching and semantic similarity calculation to build an automatic scoring framework, using the improved P-means model to generate standard answers and sentence vectors and calculate semantic similarity, and the experimental data show that the highest accuracy rate of the algorithm is 94.3% [13]. Yuan’s research group proposed an automatic essay scoring system based on a linear regression machine learning algorithm, combining linguistic features in a multiple regression approach to complete essay evaluation and model performance analysis [14]. Xia et al. proposed an automatic essay scoring model based on a neural network structure, which includes a long- and short-term memory layer and an attention mechanism layer; using pre-trained generated word vectors to calculate the experimental data showed that the quadratic weighted kappa coefficients of this model outperformed the bidirectional long- and short-term memory model [15]. Saihanqiqige proposed a multi-model fusion algorithm based on word vectors in order to improve the accuracy of English proficiency assessment, combining the text representation method of word vector clustering and the word vector space model to complete the automatic scoring of English composition [16].

In summary, many researchers have designed many algorithms and models for semantic space and automatic scoring, but the accuracy of these algorithms and models has yet to be improved. Therefore, the study proposes an unsupervised semantic space-based English composition tangent model, which is expected to improve the accuracy of automatic English composition scoring and reduce teachers’ workload.

3 Designing a model for English composition tangents in unsupervised semantic space

3.1 Unsupervised semantic space and tangent analysis algorithm design

The unsupervised semantic space, namely, the Hierarchical Topic Tree Hybrid Semantic Space (HTTHSS), is proposed for implementing topic representation and clustering in English composition. The hierarchical tree topic tree hybrid space is mainly composed of a ternary hierarchical theme plants (HTP) model, a distributed vector group of topic relationships, and a topic semantic space based on a knowledge base [17]. The monads in the HTP model are replaced by relational triads to achieve a thematic semantic representation of sentence semantics and structural components, and Figure 1 illustrates the triad hierarchical tree topic model.

Figure 1 
                  Relational triple hierarchical tree theme model.
Figure 1

Relational triple hierarchical tree theme model.

In Figure 1, the hyperparameters controlling the probability of a new path are γ , the set of infinite paths in the restaurant process nested under the adjacent distance of the phrase is T , the assignment of the i th topic relational triad is w p c i , the probability distribution of the topics of the text-relational triad is θ m , the hyperparameters of θ m obeying the Dirichlet distribution are α , the potential topics are z m , n , the relational triad from which the text is extracted is ω m , n , M represents the quantity of texts, N represents the quantity of relational triads, and the quantity of uncertainties in the number of topics is , and the form ( S , R , O ) is used to represent the relational triad. The formula for defining the restaurant process in nesting under the improved word group adjacency distance was studied as shown in the following equation:

(1) p ( wp c i = j | D , α ) f ( d ij ) , i j α , i = j ,

where d i j denotes the i and j topic relation triads’ adjacency distances, the set of all topic relation triads in English composition is D , and the decay function is f . The current allocation of topic-relational triples is only affected by the adjacency distance between topic triples, and the decay function is introduced to regulate the relationship between distance and random distribution [18]. The study uses the Gipsy sampling algorithm to implement topic sampling in the relational triad hierarchical tree model, and the distribution conditions for each potential topic variable are shown in the following equation:

(2) p ( c m , l new | c m , l , w , η ) p ( c m , l new | D , α ) p ( w | z ( c m , l c m , l new ) m , n , H 0 ) ,

where the new sampling path is c m , l new , the relational triad of potential topics is z ( c ) m , n , the set of model hyperparameters is η = { D , α , f , H 0 } , the underlying distribution obeying the Dirichlet distribution is H 0 , the relational triad of observed topics is w , the prior distribution is p ( c m , l new D , α ) , the probability of observed topic values is p ( w z ( c m , l c m , l new ) m , n ) , and the number of topics is l . The relational triad hierarchical tree topic model converts the text semantic space from high-dimensional to low-dimensional, where each word in the relational triad ( S , R , O ) needs to be represented as a vector, whose N-dimensional distribution vector and calculation formula are shown in the following equation:

(3) vec ( S ) = [ s 1 , s 2 , , s n ] vec ( R ) = [ r 1 , r 2 , , r n ] vec ( O ) = [ o 1 , o 2 , , o n ] vec ( R triad ) = λ 1 vec ( S ) + λ 2 vec ( R ) + λ 3 vec ( O ) ,

where the subject, relation, and object hyperparameters are λ 1 , λ 2 , and λ 3 , respectively, and the sum of the three is 1. In this research, the set of relational triples is { T 1 , T 2 , , T L } , the corresponding candidate triples are K = { K 1 , K 2 , , K M } , and a new candidate topic set K ( new ) = { K 1 , K 2 , , K N } is iteratively generated. This study embeds distributed vectors and trains a HTP model with three parameters: topic smoothing, the predicted number of clusters of thematic relational triples, and the smoothed distribution of relational triples. The richness of semantics makes it possible to express the same meaning with different words, and the same word can correspond to different semantics. To solve the problems of lexical mismatch and linguistic ambiguity, the study uses feature dimensionality reduction to select the set of semantic features and complete the optimization of the feature semantic space. The feature selection algorithm is an important part of the feature reduction method, extracting the set of features that best reflect the model category and improving the efficiency of text classification. The study applies the local dimensionality reduction and global dimensionality reduction methods to the semantic space construction, and the mutual information algorithm is used in the tangent model to achieve topic coherence and viewpoint feature selection, and the mutual information value PMI is calculated as shown in the following equation:

(4) PMI ( ω i ) = j N 1 log P ( ω i , ω j ) P ( ω i ) P ( ω j ) ,

where the list of the first N words in the topic list is ω , the first words in the topic list is i ω i , the first j words in the topic list is ω j , the probability that the word ω i and the word ω j occur together is P ( ω i , ω j ) , the probability that the word occurs is ω i P ( ω i ) , and the probability that the word ω j occurs is P ( ω j ) . The study used the distributed vector of relational triads as the premise of the tangent analysis and analysed the semantic similarity between sentences and composition topic semantics, sentences and composition paragraph topic semantics, paragraph and topic topic topic semantics, and full text and topic topic topic semantics, and the flow of the tangent analysis algorithm is shown in Figure 2.

Figure 2 
                  Process of relevance analysis algorithm.
Figure 2

Process of relevance analysis algorithm.

The HTTHSS represents the English composition topics as a topic relation triplet distributed vector. Let the relation triplet contain i topic relations, and each topic distribution triplet distributed vector is shown in the following equation:

(5) T ( S i , R i , O i ) = [ λ 1 s 1 , i + λ 2 r 1 , i + λ 3 o 1 , i , λ 1 s 2 , i + λ 2 r 2 , i + λ 3 o 2 , i , , λ 1 s n , i + λ 2 r n , i + λ 3 o n , i ] ,

where the hyperparameters are λ 1 , λ 2 , and λ 3 , and λ 1 + λ 2 + λ 3 = 1 , the distributed vector of English composition topics T title , S denotes the sentence topic relations triplet’s distributed vector, the distributed vector of paragraph and full text topic relations triplet P , and the distributed vector of C for the full text of English composition are shown in the following equation:

(6) T title = T ( S 1 , R 1 , O 1 ) + T ( S 2 , R 2 , O 2 ) + + T ( S i , R i , O i ) S = T ( S 1 , R 1 , O 1 ) + T ( S 2 , R 2 , O 2 ) + + T ( S j , R j , O j ) P = T ( S 1 , R 1 , O 1 ) + T ( S 2 , R 2 , O 2 ) + + T ( S k , R k , O k ) C = T ( S 1 , R 1 , O 1 ) + T ( S 2 , R 2 , O 2 ) + + T ( S m , R m , O m ) ,

where i , j , k , and m represent the dimensions of the four relational triadic distributed vectors, and then, the semantic similarity between English composition sentences and composition topics, sentences and composition paragraphs, paragraphs and topics, and full text and topics is calculated as shown in the following equation:

(7) cos θ S T = i = 1 n ( s i μ s ) × ( t i μ t ) i = 1 n ( s i μ s ) 2 × i = 1 n ( t i μ t ) 2 cos θ S P = i = 1 n ( s i μ s ) × ( p i μ p ) i = 1 n ( s i μ s ) 2 × i = 1 n ( p i μ p ) 2 cos θ P T = i = 1 n ( p i μ p ) × ( t i μ t ) i = 1 n ( p i μ p ) 2 × i = 1 n ( t i μ t ) 2 cos θ C T = i = 1 n ( c i μ c ) × ( t i μ t ) i = 1 n ( c i μ c ) 2 × i = 1 n ( t i μ t ) 2 ,

where the vector dimension is, the sentence distributed vector mean is, the topic distributed vector mean is n μ s μ t , the paragraph distributed vector mean is μ p , and the full-text distributed vector mean is μ c . δ 1 represents the semantic likeness between the sentence and the text. δ 2 denotes the semantic likeness between the sentence and the essay. The final sentence tangent similarity was calculated as shown in the following equation:

(8) cos θ in-topic = δ 1 cos θ S T + δ 2 cos θ S P ,

where the sentence cut semantic similarity is cos θ in-topic , the cut semantic similarity is ranked, and the cut sentence extraction threshold is set.

3.2 Design of English composition tangent model in unsupervised semantic space

The English composition cut model in unsupervised semantic space generates cut analysis results through English composition pre-processing, semantic space establishment, English composition cut analysis, and English composition cut quality analysis, and the processing flow of the English composition cut model in unsupervised semantic space as presented in the figure in the articles [19].

The basic link of text analysis is pre-processing, which has an important role in the semantic analysis later on. Using natural language processing tools, special character filtering, segmentation, sentence division, word division, and information extraction are completed [19]. The special characters and Chinese characters that appear in the English writing process can affect the segmentation, sentence, and word division of English compositions. The study establishes a special character set for English composition writing, combines regular expressions to filter special characters, and then slices the filtered English compositions in a global to partial manner to obtain sliced paragraphs, sentences, and words. The deactivated word set is then used to delete words in the sentences that do not affect the semantics, and then, the words in the composition are converted to their initial state for subsequent processing operations.

Words with the same grammatical properties belong to the same lexical property. There are 10 types of English lexical properties, and 2 special lexical properties are transitive and intransitive. The lexical property of each word needs to be labelled to facilitate the subsequent building of a dependency syntactic tree [20]. The study uses a recurrent dependency neural network lexical annotator, which obtains lexical labels through symmetric inference, relying on the existence of correlations between each node in the neural network and neighbouring nodes, and obtains conditional similarity maximization of node training data through local model training, to prepare for English text processing and analysis. The study injects a large number of rare word set features into the lexical annotator to achieve correct annotation of unknown words.

Information extraction is a text processing technique that extracts factual information such as entities, relations, and events of specified types from natural language text and forms structured data output. The relational triad form contains subjects, relations, and objects. The relationships between words in a sentence are realized through dependency syntactic analysis, and the dependency syntactic tree expression is shown in the following equation:

(9) f = { ( s , r , o ) : 0 s m , 1 o m , r C } ,

where the textual dependency arc is ( s , r , o ) , the dependency type of the core word and the modifier is r , and the set of dependency types is C . There is a wide variety of dependencies between words, and the results of the dependency syntactic analysis are shown in Figure 3.

Figure 3 
                  Dependency parsing results.
Figure 3

Dependency parsing results.

The study comprehensively analyses the quality of English composition tangency by calculating the semantics of English composition sentences and composition topics, the semantics of sentences and composition paragraphs, the semantics of paragraphs and topic topics, and the semantic similarity of full text and topic topics, and analyses the quality of composition topics and opinion expressions by combining the topic coherence algorithm and the topic opinion algorithm. The formula for calculating the cut score of English composition is shown in the following equation:

(10) G inTopic = γ 1 i = 1 N cos θ S T N + γ 2 i = 1 N cos θ S P N + γ 3 j = 1 M cos θ P T N + γ 4 cos θ C T ,

where the hyperparameter is γ and γ 1 + γ 2 + γ 3 + γ 4 = 1 , the score of English composition tangential analysis is G inTopic , the sum of the semantic similarity of sentences and paragraphs in N and i = 1 N cos θ S T and i = 1 N cos θ S P , respectively, the sum of the semantic similarity of paragraphs and topics in M and j = 1 M cos θ P T , and the semantic similarity of full English composition and topics in cos θ C T . The formula for calculating the score of topic coherence is shown in the following equation:

(11) G TopicCoherence = ε 1 i = 1 N TPMI ( s center ) N + ε 2 i = 1 2 M 2 TPMI ( p center p ) 2 M 2 + ε 3 i = 1 M TPMI ( p center c ) M ,

where the thematic coherence score of the English composition is G TopicCoherence , the hyperparameter is ε and ε 1 + ε 2 + ε 3 = 1 , the sum of the thematic coherence of the sentences of N and the paragraphs to which they belong is i = 1 N TPMI ( s center ) , the sum of the thematic coherence of the paragraphs of M and the context, and the sum of the thematic coherence of the whole text is i = 1 2 M 2 TPMI ( p center p ) and i = 1 M TPMI ( p center c ) , respectively. The formula for calculating the thematic opinion score of the English composition is shown in the following equation:

(12) G TopicSen = η 1 j = 1 M 2 M cos θ Senti p 1 p 2 M 2 M + η 2 j = 1 M cos θ Senti pc M ,

where the topic perspective score of English composition is G TopicSen , the hyperparameter is η and η 1 + η 2 = 1 , and the sum of the semantic correlation of affective tendency between the paragraphs of English composition M and the paragraphs, and the sum of the semantic correlation of affective tendency between English composition and the whole text are j = 1 M 2 M cos θ Senti p 1 p 2 and j = 1 M cos θ Senti p c , respectively. The English essay quality score was obtained by weighting the scores of English essay tangency, coherence, and thematic viewpoints as shown in the following equation:

(13) G = ( ρ 1 G inTopic + ρ 2 G TopicCoherence + ρ 3 G TopicSen ) × 100 ,

where the English composition cut quality score is G , the cut quality score hyperparameter is ρ , and ρ 1 + ρ 2 + ρ 3 = 1 . The generation process of an English essay topic cutting model in an unsupervised semantic space is a complex process that involves multiple steps, including data preprocessing, feature extraction, model training, and prediction. In the data preprocessing stage, it is necessary to clean and standardize the original data to ensure the quality and consistency of the data. Next, in the feature extraction stage, meaningful feature vectors need to be extracted from the original data, which should be able to capture important information in the text data. Then, during the model training phase, machine learning algorithms can be used to classify feature vectors. Finally, in the prediction phase, the trained model is applied to new text data to predict the category or label of the text.

4 Experimentation and analysis of English composition tangent model in unsupervised semantic space

4.1 Experimental test sets and assessment criteria

The relational triad HTP model’s parameters were optimized by the Wikipedia corpus, The International Corpus Network of Asian Learners of English (ICNALE), and the Chinese learner English corpus. The HTP model’s parameters were optimized by setting the topic smoothing distribution to 10, the predictive clustering parameter of the topic-relational triad to 1, and the relational triad smoothing distribution to 0.1. The test set data are shown in Figure 4.

Figure 4 
                  Test set data.
Figure 4

Test set data.

A corresponding number of run-on essays were added to each topic in the test set. The sentence test set was a random selection of 16,010 sentences from 1,000 essays under two essay topics from ICNALE, including 6,930 tangential sentences and including 5,200 thematically incoherent sentences. The number of samples with positive model and manual assessment is A , the number of samples with positive model but inconsistent manual assessment is B , the number of samples with negative model but inconsistent manual assessment is C , and the number of samples with negative model and manual assessment is D . The accuracy rate of model is P , the higher the value the better the model measurement effect; the recall rate is R , the higher the value the higher the model checking rate; the comprehensive evaluation index is F , the higher the value the better the model classification effect. The experiment referred to the scoring standard of English 4/6 level essay to develop the evaluation criteria of English essay tangency, as shown in Table 1.

Table 1

Evaluation criteria for the degree of relevance in English compositions

Score Evaluation criterion
90–100 Content to the point; good thematic coherence
Express clearly and fluently
80–90 Content to the point; good thematic coherence; express opinions clearly
Minor language errors
70–80 Content is basically relevant to the topic; theme is generally coherent
Inadequate expression of thematic viewpoints
60–70 Content is basically relevant to the topic; poor thematic coherence and expression of thematic viewpoints
0–60 Content is not well organized; poor thematic coherence; chaotic expression of thematic viewpoints

The experiment reflects the relevance of the two by evaluating the degree of tangibility of the English composition through the model and calculating the correlation index Pearson’s correlation coefficient by combining the results of the teachers’ evaluation. The Pearson correlation coefficient is r x , z , and its value ranges from [−1,1], with a higher value indicating a stronger correlation between the two samples. The i sample in the teacher manual assessment sample set is x x i , the i sample in the model assessment sample set z is z i , and the sum of the sample trees in the sets x and z is n .

4.2 Analysis of model experimental data

The experimental test set consisted of 22,000 English essays under ten topics in the corpus, of which 15,000 were tangential essays as well as 7,000 run-on essays. Table 2 indicates the experimental results of the research that proposed a semantic space-based approach to the analysis of tangential English essays.

Table 2

Experimental results of relevance analysis method

Composition title P (%) R (%) F (%)
My dream 94.01 87.21 90.48
Enjoy the fun of process 94.86 86.97 90.74
Never submit to difficulty 93.42 88.31 90.79
The importance of interest and willpower 94.56 89.01 91.70
To be the most special one 94.89 88.78 91.73
Family pets 92.18 88.12 90.10
The allocation of time 94.78 87.03 90.74
Life attitude 89.96 87.54 88.73
Growth that starts from thinking 90.36 89.82 90.09
My favourite sport 90.93 87.56 89.21

Analysis of the experimental data in Table 2 shows that the highest value of accuracy of the semantic space-based English composition tangent analysis algorithm is 94.86%, the average accuracy is 93.00%, the highest value of recall is 89.82%, the average recall is 88.04%, the highest F-value is 91.73%, and the average F-value is 90.43%. Due to the study’s extended topic set, the effectiveness of the tangential analysis method was stable under different lengths of English composition topics. The experiments were conducted to verify the effectiveness of the analysis of the hybrid semantic space of HTP (marked as D) by Word2Vec + topic hierarchical tree semantic space (marked as A), Word2Vec + improved topic hierarchical tree semantic space (marked as B), and Word2Vec + topic hierarchical tree + knowledge base semantic space (marked as C) as comparisons, and the experimental results are shown in Figure 5.

Figure 5 
                  Experimental results of semantic space relevance analysis.
Figure 5

Experimental results of semantic space relevance analysis.

As can be seen in Figure 5, the recall rates of the four semantic spaces for cut-topic analysis do not differ much, all remaining around 88%, but the recall rate under the HTTHSS is slightly higher at 88.53%. The accuracy rate of English tangent analysis in the HTTHSS was 91.65% with an F-value of 90.08%, indicating that the experiment was effective. The experiments compared the WEDVRM (marked as A), the LDA + WEDVRM (marked as B) with the research proposed semantic space-based English composition tangent analysis algorithm (marked as C) for 5,000, 10,000, 15,000, and 20,000 compositions in the test set, respectively, and the experimental results are shown in Figure 6.

Figure 6 
                  Comparison of experimental results of different relevance analysis methods. (a) 5,000 essays, (b) 10,000 essays, (c) 15,000 essays, and (d) 20,000 essays.
Figure 6

Comparison of experimental results of different relevance analysis methods. (a) 5,000 essays, (b) 10,000 essays, (c) 15,000 essays, and (d) 20,000 essays.

As can be seen from Figure 6, the three indicators corresponding to the three algorithms increased as the number of compositions increased, indicating that the more English compositions were analysed, the more stable the performance of the tangential analysis algorithm became. The accuracy and F-value of the semantic space-based English composition tangent analysis algorithm were significantly higher than those of the Word Embedding Distributed Vector Representation Method (WEDVRM) and the LDA + WEDVRM, with an accuracy and F-value of 91.65 and 90.12%, respectively, when 20,000 English compositions were analysed, increasing the accuracy and F-value of the tangent analysis algorithm by 5 and 3%, respectively. The recall rates for the three algorithms did not differ significantly, all remaining at around 88%. The tangent analysis model analysed 16,010 sentences (which included 6,930 marked tangent sentences) from 1,000 essays, with different tangent thresholds set in the tangent experiments, and the experimental results are shown in Figure 7.

Figure 7 
                  Experimental results of topic analysis models under different extraction thresholds for topic specific sentences. (a) Results of extracting relevant sentences and (b) the threshold is 0.6 for extracting results.
Figure 7

Experimental results of topic analysis models under different extraction thresholds for topic specific sentences. (a) Results of extracting relevant sentences and (b) the threshold is 0.6 for extracting results.

Figure 7(a) shows the experimental results of the tangent analysis model under different tangent sentence extraction thresholds. Analysis of the experimental data shows that the highest F-value of the tangent analysis algorithm is 86.67% when the threshold is 0.6, and the tangent sentence extraction effect is more satisfactory at this time. When the extraction threshold was low, the recall rate of tangent extraction was higher but the accuracy rate was lower; when the extraction threshold was high, the accuracy rate of tangent extraction was higher but the recall rate was lower, both of which could not complete the task of tangent extraction stably. Figure 7(b) shows the experimental results of the model’s tangential sentence extraction of 1,000 articles when the tangential extraction threshold is set to 0.6. Analysis of the experimental data shows that when the quantity of English essays increases, the precision of the tangential analysis algorithm also increases, with the highest accuracy rate being 89.12% and the recall rate being basically stable, with the mean value of F being 86.67%. The model analysed 16,010 sentences (which included 5,200 marked thematically incoherent sentences) from 1,000 essays, and the experiments were also set to different thresholds, and the results are shown in Figure 8.

Figure 8 
                  Analysing the results of disconnected sentences on the topic. (a) Incoherent sentence extraction result and (b) extracting incoherent sentences with a threshold of 0.32.
Figure 8

Analysing the results of disconnected sentences on the topic. (a) Incoherent sentence extraction result and (b) extracting incoherent sentences with a threshold of 0.32.

Analysis of the experimental data in Figure 8 shows that the highest F-value of 87.68% was achieved when the extraction threshold for thematically incoherent sentences was 0.32. When the quantity of English compositions increases, the precision of the tangential analysis algorithm also increases, with a maximum accuracy of 90.16% and a decrease in the recall rate. To verify the practical application of the English composition tangent model in unsupervised semantic space, the experiments used the professional English teachers’ ratings as a comparison and scored 1,000 English compositions tangentially, respectively, and the experimental results are shown in Figure 9.

Figure 9 
                  Comparison of model scoring and teacher scoring results.
Figure 9

Comparison of model scoring and teacher scoring results.

In Figure 9, the results of the model scoring and teacher scoring are mainly concentrated between 80 and 90 points. Analysis of the experimental data shows that the mean value of the English composition tangent model scoring 1,000 English compositions under unsupervised semantic space is 83.56 points and the mean value of teacher scoring is 82.36 points, and the difference between the two scores as well as the Pearson correlation coefficient are 1.2 points and 0.8936 points, respectively, with a strong correlation grade. The experiment shows that the English composition tangent model under unsupervised semantic space has high credibility and practicality.

5 Conclusion

In the form of globalization, English occupies a major position as a common language, and various English examinations and competitions usually involve the writing of English essays, and the marking of essays is also a big project. To improve the efficiency of English composition correction, the study proposes an unsupervised semantic space-based English composition tangent model, combining a relational triad hierarchical tree topic model with a tangent analysis algorithm to achieve intelligent scoring of English compositions. The experimental data show that the accuracy and F-value of the semantic space-based English composition tangent analysis method are significantly higher than those of the WEDVRM and the LDA + WEDVRM, and the precision and F-value are 91.65 and 90.12%, respectively, when 20,000 English compositions are analysed: 86.67%, which is a good result for tangential sentence extraction. The highest F-value of 87.68% was obtained for a threshold of 0.32 for the extraction of thematically incoherent sentences. When the quantity of English compositions increases, the precision of the tangential analysis algorithm also increases, with a maximum accuracy of 90.16%. The mean score of the English composition tangent model in unsupervised semantic space for 1,000 English compositions was 83.56, and the mean score of the teacher’s score was 82.36, and the difference between the two scores and the Pearson correlation coefficient were 1.2 and 0.8936, respectively. The results show that the English composition tangent model in unsupervised semantic space is more stable and applicable, and can accurately and efficiently complete the English composition criticism task. The study is a tangential analysis of the topics, sentences, and paragraphs of English composition, and the tangential analysis can be further improved subsequently by combining features such as chapter structure.

  1. Funding information: The research was supported by National level, A Study on the construction and promotion of Jiaodong Culture into English Quality Courses in Higher Vocational Colleges (No. WYW2022A03).

  2. Author contributions: Rihong Tang: methodology, conceptualization, Writing – Original Draft, Writing – Review & Editing.

  3. Conflict of interest: The author reports there are no competing interests to declare.

  4. Data availability statement: The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

References

[1] Cowen AS, Keltner D. Semantic space theory: A computational approach to emotion. Trends Cognit Sci. 2021;25(2):124–36.10.1016/j.tics.2020.11.004Search in Google Scholar PubMed

[2] Sato N, Matsumoto R, Shimotake A, Matsuhashi M, Otani M, Kikuchi T, et al. Frequency-dependent cortical interactions during semantic processing: an electrocorticogram cross- spectrum analysis using a semantic space model. Cereb Cortex. 2021;31(9):4329–39.10.1093/cercor/bhab089Search in Google Scholar PubMed

[3] Huang GM, Zhang XW. An analysis model of potential topics in English essays based on semantic space. J Comput. 2022;33(1):151–64.10.53106/199115992022023301014Search in Google Scholar

[4] Neumeyer L, Franco H, Digalakis V, Weintraub M. Automatic scoring of pronunciation quality. Speech Commun. 2000;30(2–3):83–93.10.1016/S0167-6393(99)00046-1Search in Google Scholar

[5] Nimrah S, Saifullah S. Context-free word importance scores for attacking neural networks. J Comput Cognit Eng. 2022;1(4):187–92.10.47852/bonviewJCCE2202406Search in Google Scholar

[6] Waziri TA, Yakasai BM. Assessment of some proposed replacement models involving moderate fix-up. J Comput Cognit Eng. 2023;2(1):28–37.10.47852/bonviewJCCE2202150Search in Google Scholar

[7] Xiao W, Luo X, Xie S. Feature semantic space-based sim2real decision model. Appl Intell. 2023;53(3):4890–906.10.1007/s10489-022-03566-5Search in Google Scholar

[8] Kherwa P, Bansal P. Three level weight for latent semantic analysis: an efficient approach to find enhanced semantic themes. Int J Knowl Learn. 2023;16(1):56–72.10.1504/IJKL.2023.127328Search in Google Scholar

[9] Yu J, Jin L, Chen J, Xiao Y, Tian Z, Lan X. Deep semantic space guided multi-scale neural style transfer. Multimed Tools Appl. 2022;81(3):3915–38.10.1007/s11042-021-11694-2Search in Google Scholar

[10] Yu HF. Bibliographic automatic classification algorithm based on semantic space transformation. Multimed Tools Appl. 2020;79(13):9283–97.10.1007/s11042-019-7400-3Search in Google Scholar

[11] Orhan U, Tulu CN. A novel embedding approach to learn word vectors by weighting semantic relations: semspace. Expert Syst Appl. 2021;180:115146–53.10.1016/j.eswa.2021.115146Search in Google Scholar

[12] Zhao Y. Research and design of automatic scoring algorithm for english composition based on machine learning. Sci Program. 2021;3429463–72.10.1155/2021/3429463Search in Google Scholar

[13] Wang H, Zhao Y, Lin H, Zuo X. Automatic scoring of Chinese fill-in-the-blank questions based on improved P-means. J Intell Fuzzy Syst. 2021;40(3):5473–82.10.3233/JIFS-202317Search in Google Scholar

[14] Yuan Z. Interactive intelligent teaching and automatic composition scoring system based on linear regression machine learning algorithm. J Intell & Fuzzy Syst. 2021;40(2):2069–81.10.3233/JIFS-189208Search in Google Scholar

[15] Xia L, Luo D, Liu J, Guan M, Zhang Z, Gong A. Attention-based two-layer long short-term memory model for automatic essay scoring. J Shenzhen Univ Sci Eng. 2021;37(6):559–66.10.3724/SP.J.1249.2020.06559Search in Google Scholar

[16] Saihanqiqige HE. Application research of english scoring based on TF-IDF clustering algorithm. IOP Conf Ser: Mater Sci Eng. 2020;750(1):12215–301.10.1088/1757-899X/750/1/012215Search in Google Scholar

[17] Lewis M, Marsden D, Sadrzadeh M. Semantic spaces at the intersection of NLP, physics, and cognitive science. FLAP. 2020;7(5):677–82.Search in Google Scholar

[18] Shi L, Du J, Liang M, Kuo F. Dynamic topic modeling via self-aggregation for short text streams. Peer-to-Peer Netw Appl. 2019;12(1):1403–17.10.1007/s12083-018-0692-7Search in Google Scholar

[19] Kou F, Du J, Lin Z, Liang M, Li H, Shi L, et al. A semantic modeling method for social network short text based on spatial and temporal characteristics. J Comput Sci. 2018;28(1):281–93.10.1016/j.jocs.2017.10.012Search in Google Scholar

[20] Shi L, Song G, Cheng G, Liu X. A user-based aggregation topic model for understanding user’s preference and intention in social network. Neurocomputing. 2020;413(1):1–13.10.1016/j.neucom.2020.06.099Search in Google Scholar

Received: 2023-09-01
Accepted: 2024-02-20
Published Online: 2024-05-22

© 2024 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Research Articles
  2. A study on intelligent translation of English sentences by a semantic feature extractor
  3. Detecting surface defects of heritage buildings based on deep learning
  4. Combining bag of visual words-based features with CNN in image classification
  5. Online addiction analysis and identification of students by applying gd-LSTM algorithm to educational behaviour data
  6. Improving multilayer perceptron neural network using two enhanced moth-flame optimizers to forecast iron ore prices
  7. Sentiment analysis model for cryptocurrency tweets using different deep learning techniques
  8. Periodic analysis of scenic spot passenger flow based on combination neural network prediction model
  9. Analysis of short-term wind speed variation, trends and prediction: A case study of Tamil Nadu, India
  10. Cloud computing-based framework for heart disease classification using quantum machine learning approach
  11. Research on teaching quality evaluation of higher vocational architecture majors based on enterprise platform with spherical fuzzy MAGDM
  12. Detection of sickle cell disease using deep neural networks and explainable artificial intelligence
  13. Interval-valued T-spherical fuzzy extended power aggregation operators and their application in multi-criteria decision-making
  14. Characterization of neighborhood operators based on neighborhood relationships
  15. Real-time pose estimation and motion tracking for motion performance using deep learning models
  16. QoS prediction using EMD-BiLSTM for II-IoT-secure communication systems
  17. A novel framework for single-valued neutrosophic MADM and applications to English-blended teaching quality evaluation
  18. An intelligent error correction model for English grammar with hybrid attention mechanism and RNN algorithm
  19. Prediction mechanism of depression tendency among college students under computer intelligent systems
  20. Research on grammatical error correction algorithm in English translation via deep learning
  21. Microblog sentiment analysis method using BTCBMA model in Spark big data environment
  22. Application and research of English composition tangent model based on unsupervised semantic space
  23. 1D-CNN: Classification of normal delivery and cesarean section types using cardiotocography time-series signals
  24. Real-time segmentation of short videos under VR technology in dynamic scenes
  25. Application of emotion recognition technology in psychological counseling for college students
  26. Classical music recommendation algorithm on art market audience expansion under deep learning
  27. A robust segmentation method combined with classification algorithms for field-based diagnosis of maize plant phytosanitary state
  28. Integration effect of artificial intelligence and traditional animation creation technology
  29. Artificial intelligence-driven education evaluation and scoring: Comparative exploration of machine learning algorithms
  30. Intelligent multiple-attributes decision support for classroom teaching quality evaluation in dance aesthetic education based on the GRA and information entropy
  31. A study on the application of multidimensional feature fusion attention mechanism based on sight detection and emotion recognition in online teaching
  32. Blockchain-enabled intelligent toll management system
  33. A multi-weapon detection using ensembled learning
  34. Deep and hand-crafted features based on Weierstrass elliptic function for MRI brain tumor classification
  35. Design of geometric flower pattern for clothing based on deep learning and interactive genetic algorithm
  36. Mathematical media art protection and paper-cut animation design under blockchain technology
  37. Deep reinforcement learning enhances artistic creativity: The case study of program art students integrating computer deep learning
  38. Transition from machine intelligence to knowledge intelligence: A multi-agent simulation approach to technology transfer
  39. Research on the TF–IDF algorithm combined with semantics for automatic extraction of keywords from network news texts
  40. Enhanced Jaya optimization for improving multilayer perceptron neural network in urban air quality prediction
  41. Design of visual symbol-aided system based on wireless network sensor and embedded system
  42. Construction of a mental health risk model for college students with long and short-term memory networks and early warning indicators
  43. Personalized resource recommendation method of student online learning platform based on LSTM and collaborative filtering
  44. Employment management system for universities based on improved decision tree
  45. English grammar intelligent error correction technology based on the n-gram language model
  46. Speech recognition and intelligent translation under multimodal human–computer interaction system
  47. Enhancing data security using Laplacian of Gaussian and Chacha20 encryption algorithm
  48. Construction of GCNN-based intelligent recommendation model for answering teachers in online learning system
  49. Neural network big data fusion in remote sensing image processing technology
  50. Research on the construction and reform path of online and offline mixed English teaching model in the internet era
  51. Real-time semantic segmentation based on BiSeNetV2 for wild road
  52. Online English writing teaching method that enhances teacher–student interaction
  53. Construction of a painting image classification model based on AI stroke feature extraction
  54. Big data analysis technology in regional economic market planning and enterprise market value prediction
  55. Location strategy for logistics distribution centers utilizing improved whale optimization algorithm
  56. Research on agricultural environmental monitoring Internet of Things based on edge computing and deep learning
  57. The application of curriculum recommendation algorithm in the driving mechanism of industry–teaching integration in colleges and universities under the background of education reform
  58. Application of online teaching-based classroom behavior capture and analysis system in student management
  59. Evaluation of online teaching quality in colleges and universities based on digital monitoring technology
  60. Face detection method based on improved YOLO-v4 network and attention mechanism
  61. Study on the current situation and influencing factors of corn import trade in China – based on the trade gravity model
  62. Research on business English grammar detection system based on LSTM model
  63. Multi-source auxiliary information tourist attraction and route recommendation algorithm based on graph attention network
  64. Multi-attribute perceptual fuzzy information decision-making technology in investment risk assessment of green finance Projects
  65. Research on image compression technology based on improved SPIHT compression algorithm for power grid data
  66. Optimal design of linear and nonlinear PID controllers for speed control of an electric vehicle
  67. Traditional landscape painting and art image restoration methods based on structural information guidance
  68. Traceability and analysis method for measurement laboratory testing data based on intelligent Internet of Things and deep belief network
  69. A speech-based convolutional neural network for human body posture classification
  70. The role of the O2O blended teaching model in improving the teaching effectiveness of physical education classes
  71. Genetic algorithm-assisted fuzzy clustering framework to solve resource-constrained project problems
  72. Behavior recognition algorithm based on a dual-stream residual convolutional neural network
  73. Ensemble learning and deep learning-based defect detection in power generation plants
  74. Optimal design of neural network-based fuzzy predictive control model for recommending educational resources in the context of information technology
  75. An artificial intelligence-enabled consumables tracking system for medical laboratories
  76. Utilization of deep learning in ideological and political education
  77. Detection of abnormal tourist behavior in scenic spots based on optimized Gaussian model for background modeling
  78. RGB-to-hyperspectral conversion for accessible melanoma detection: A CNN-based approach
  79. Optimization of the road bump and pothole detection technology using convolutional neural network
  80. Comparative analysis of impact of classification algorithms on security and performance bug reports
  81. Cross-dataset micro-expression identification based on facial ROIs contribution quantification
  82. Demystifying multiple sclerosis diagnosis using interpretable and understandable artificial intelligence
  83. Unifying optimization forces: Harnessing the fine-structure constant in an electromagnetic-gravity optimization framework
  84. E-commerce big data processing based on an improved RBF model
  85. Analysis of youth sports physical health data based on cloud computing and gait awareness
  86. CCLCap-AE-AVSS: Cycle consistency loss based capsule autoencoders for audio–visual speech synthesis
  87. An efficient node selection algorithm in the context of IoT-based vehicular ad hoc network for emergency service
  88. Computer aided diagnoses for detecting the severity of Keratoconus
  89. Improved rapidly exploring random tree using salp swarm algorithm
  90. Network security framework for Internet of medical things applications: A survey
  91. Predicting DoS and DDoS attacks in network security scenarios using a hybrid deep learning model
  92. Enhancing 5G communication in business networks with an innovative secured narrowband IoT framework
  93. Quokka swarm optimization: A new nature-inspired metaheuristic optimization algorithm
  94. Digital forensics architecture for real-time automated evidence collection and centralization: Leveraging security lake and modern data architecture
  95. Image modeling algorithm for environment design based on augmented and virtual reality technologies
  96. Enhancing IoT device security: CNN-SVM hybrid approach for real-time detection of DoS and DDoS attacks
  97. High-resolution image processing and entity recognition algorithm based on artificial intelligence
  98. Review Articles
  99. Transformative insights: Image-based breast cancer detection and severity assessment through advanced AI techniques
  100. Network and cybersecurity applications of defense in adversarial attacks: A state-of-the-art using machine learning and deep learning methods
  101. Applications of integrating artificial intelligence and big data: A comprehensive analysis
  102. A systematic review of symbiotic organisms search algorithm for data clustering and predictive analysis
  103. Modelling Bitcoin networks in terms of anonymity and privacy in the metaverse application within Industry 5.0: Comprehensive taxonomy, unsolved issues and suggested solution
  104. Systematic literature review on intrusion detection systems: Research trends, algorithms, methods, datasets, and limitations
Downloaded on 5.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/jisys-2023-0148/html
Scroll to top button