Cross-modal multi-label image classification modeling and recognition based on nonlinear

Shuping Yuan; Yang Chen; Chengqiong Ye; Mohammed Wasim Bhatt; Mhalasakant Saradeshmukh; Md Shamim Hossain

doi:10.1515/nleng-2022-0194

Article Open Access

Cross-modal multi-label image classification modeling and recognition based on nonlinear

Shuping Yuan , Yang Chen , Chengqiong Ye , Mohammed Wasim Bhatt , Mhalasakant Saradeshmukh and Md Shamim Hossain

Published/Copyright: January 24, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Nonlinear Engineering Volume 12 Issue 1

Abstract

Recently, it has become a popular strategy in multi-label image recognition to predict those labels that co-occur in a picture. Previous work has concentrated on capturing label correlation but has neglected to correctly fuse picture features and label embeddings, which has a substantial influence on the model’s convergence efficiency and restricts future multi-label image recognition accuracy improvement. In order to better classify labeled training samples of corresponding categories in the field of image classification, a cross-modal multi-label image classification modeling and recognition method based on nonlinear is proposed. Multi-label classification models based on deep convolutional neural networks are constructed respectively. The visual classification model uses natural images and simple biomedical images with single labels to achieve heterogeneous transfer learning and homogeneous transfer learning, capturing the general features of the general field and the proprietary features of the biomedical field, while the text classification model uses the description text of simple biomedical images to achieve homogeneous transfer learning. The experimental results show that the multi-label classification model combining the two modes can obtain a hamming loss similar to the best performance of the evaluation task, and the macro average F1 value increases from 0.20 to 0.488, which is about 52.5% higher. The cross-modal multi-label image classification algorithm can better alleviate the problem of overfitting in most classes and has better cross-modal retrieval performance. In addition, the effectiveness and rationality of the two cross-modal mapping techniques are verified.

Keywords: multiple label points; cross-mode retrieval; deep learning

1 Introduction

In a traditional single-label image classification system, a single image containing only a single category is annotated. In order to accurately identify this type of image, one must learn a classifier based on the known training data set and then use this classifier to classify the test images. The category of the test image must have appeared in the training phase [1]. Image classification has a wide range of uses. It may be used to differentiate between separate locations depending on their land use. Land-use information is frequently used in urban planning. High-resolution imagery is also used to examine the effects and damage caused by natural disasters such as floods, volcanic eruptions, and severe droughts [2,3]. In practice, training data and labeling information are often difficult to obtain. On the one hand, there are many kinds of things in the world, and they continue to increase. On the other hand, a certain category of things can be further subdivided into many subcategories, such as dogs can be subdivided into Tibetan kwai, pugs, huskies, etc. [4]. Therefore, the method of labeling all objects and then classifying them is inefficient and unrealistic. Due to the unequal information between image and label, the scalability of traditional single-label classification tasks is poor, and it is difficult to meet the actual requirements. The emergence of zero-sample learning (ZSL) solves the problem of missing tags to a certain extent. The aim is to mimic the ability of humans to identify new categories without having to see actual visual examples. Humans have this ability because they are able to make semantic connections between categories they have never seen and ones they have thought of. For example, if a child can recognize a horse and has never seen a zebra, he is told that a zebra is similar to a horse but with black and white stripes [5]. ZSL recognition in machines is predicated on the availability of something like a labeled training phase of seen classes as well as knowledge on how each unseen class is semantically linked to the seen classes [6,7]. Then, when a child sees a zebra, it has a better chance of accurately identifying it. Similarly, a zero-sample image classification system establishes a mapping relationship between the visual space and the semantic space through labeled training data, namely, defined categories, and assigns category labels to the test data according to the visual and semantic connections between the training data and the test data of unseen categories.

Early multi-label classification algorithms identify each item individually and divide the problem incorrectly into several classification model issues. Deep convolution neural network (CNN) has been developed for picture classification, and the precision of is doing image recognition techniques has been improved using CNN and its variants [8,9]. Deep neural networks (DNNs), specifically convolutional neural networks (CNNs), have been widely used in image text categorization since 2012, with impressive results. CNN’s medical image classification investigation yielded findings equivalent to those of a human expert [10,11]. Similar to the general domain multi-label classification task, we found that the medical image multi-label classification model is mainly faced with three challenges: unbalanced label distribution in the data set, too small scale of the annotated data set, and label dependence. First, when medical literature authors demonstrate a medical problem, they are used to combine semantically related images of various modes to generate composite images. Subgraphs of different sizes may be distributed in various positions, and the number of composite image instances corresponding to each label is not balanced [12]. Therefore, multi-label medical image classification is more complicated than single-label medical image classification. Second, a data set with a small scale is faced with a large scale of model parameters, which can easily lead to the problem of overfitting. Moreover, the cost of data set annotation is very large, and manual annotation of medical image data sets of millions of levels will cost a lot of manpower and material resources [13]. In this approach, multi-label classification is quite useful in medical data analysis. It covers topics such as diagnosis, surgery, biology, sickness progression, assessment, and teaching. Many patients, including those with eye illnesses, have many diseases that manifest themselves at the same time, both in the same function [14,15]. Multi-label classification, on the other hand, is by definition a tough task. This is due to the high dimensions, flatness, and imbalance of the data. Label difficulties include label dependency, location, interlabel diversity, and familiarity [16,17]. Zhang et al. proposed an unsupervised domain adapted image to video person re-recognition model through cross-model feature generation and target information retention Transmission Network (CMGTN). On the one hand, the design generator in our model can not only transform the unmarked sample function of the target domain into the feature space of the source domain but also retain the identity information of the target. On the other hand, we close the gap between pedestrian images and videos by embedding cross-model loss terms. In order to evaluate the performance of our approach, we conducted extensive experiments on the Prid-2011, ILIDs-VID, and MARS data sets and compared our approach to existing state-of-the-art IVPR models, including four unsupervised approaches and three supervised approaches. Experimental results show the effectiveness of our method [18].

By modeling the label dependencies, it has now become a popular strategy in multi-label image recognition to predict those labels that co-occur in a picture. Previous research has focused on capturing the link between labels, but has failed to effectively integrate the image. Features and label embeddings are reducing the model’s convergence efficiency and preventing it from becoming more exact. Image recognition with many labels is improving. Based on this, this article proposes a cross-modal multi-label image classification method based on nonlinear. A convolutional neural network based on mixed transfer learning is constructed to carry out multi-label classification. That is to say, heterogeneous transfer learning is used to overcome the overfitting problem caused by too-small labeled data sets, and homogeneous transfer learning is used to weaken the negative impact caused by unbalanced data sets.

The presented work achieves heterogeneous transfer learning and homogeneous transfer learning by using natural images and simple biomedical images with a single label, capturing the general features of the general field and the proprietary features of the biomedical field, whereas the text classification model achieves homogeneous transfer learning by using the description text of simple biomedical images. On the other hand, assuming that the current image is related to each tag, all tag association probabilities are predicted, and the prediction space is narrowed by the tag step calibration algorithm to finally determine the set of associated tags [19]. Experimental results show that the proposed method is suitable for the task of extracting subgraph pattern information from composite images of biomedical literature and can better alleviate the problem of overfitting and has good cross-modal retrieval performance. In addition, the effectiveness and rationality of cross-modal mapping technique are also verified. The aforementioned section is the introduction to the manuscript. Section 2 discusses the research methodology. Results analysis has been done in Section 3 of the article. Section 4 is the conclusion of the article which contains the crux of the whole paperwork.

2 Research methods

2.1 Cross-modal multi-label classification algorithm

By simulating the label dependencies, it has become a popular strategy in multi-label image recognition to predict those labels that co-occur in a picture. Previous research has concentrated on capturing label correlation but has failed to adequately fuse picture features and label embeddings, which has a substantial influence on the model’s converge efficiency and restricts future number of co image recognition average accuracy. A single semantic label is usually insufficient to characterize multimedia data (images, text, etc.). Cross-modal retrieval systems previously relied on single-label multimodal data sources; however, several multi-label data sets with multiple modalities have recently been introduced. Multi-label datasets consider introducing a natural many-to-many interaction across different methods; i.e., each data object from one mode is closely linked to several other data points to the other method. Such correlations should be able to include this in any multi-label merge retrieval system that learns the common cross-modal subspace. Suppose X represents the instance space and L represents the q-dimensional label set. Given a set of training sets, formula (1) is shown:

(1) T = { ( x 1 , Y 1 ) , ( x 2 , Y 2 ) , ⋯ , ( x n , Y n ) } ( x i ∈ X , Y i ⊆ L ) .

The goal is to learn a multi-label classifier h : X → 2 L . However, for convenience, learn a real numerical scoring function f : X × L → R . Given an instance x i and a set of associated labels Y i , for labels y ∈ Y i , f ( x i , y ) a greater posterior probability should be output. In other words, this is true for y 1 ∈ Y i and y 2 ∉ Y i , f ( x i , y 1 ) > f ( x i , y 2 ) any sum. Using the scoring function f ( . , . ) , the multi-label classifier can be obtained as shown in formula (2):

(2) h ( x i ) = { y ∣ f ( x i , y ) > t , y ∈ L } .

Here, T can be a constant (for example, 0.5), or a function of inferring thresholds from the training set, which can divide the tag space pairs into related and unrelated tag sets.

2.2 Transfer learning model

The multi-label transfer learning model includes text and visual parts, as shown in Figure 1.

Figure 1

Architecture of cross-modal multi-label classification model.

2.2.1 Image model

At present, the advanced extreme deep convolutional neural network RESnet-50, which is a deep residual network with a depth of 50 layers, has achieved excellent results in the field of natural image recognition. The network was originally designed for multi-category classification tasks and required fine-tuning of the ResNet network structure to accommodate multi-label classification tasks. The binary cross-entropy loss function is replaced, and the last layer SoftMax is replaced with the Sigmoid activation function.

(3) sigmoid ( x ) = 1 1 + exp ( − x ) .

It is used to estimate the relative posterior probability of each label. X is used to represent n samples in the training set, the learning rate is controlled by the Admax optimizer, the model is trained with 32 random images in a small batch, and the weight w is updated iteratively to minimize the loss function:

(4) L ( w , X ) = 1 n ∑ i = 1 n l ( f ( x i , w ) , y ′ i ) ,

where x i is the ith sample in training set X. When the weight is w, the prediction probability vector of related categories of the output sample is denoted as; Is the true correlation category vector of the ith sample, which is represented by one-hot; Is the prediction category calculated by elements, instead of the penalty function, as shown in Eq. (5):

(5) l ( y i , y ″ i ) = − ∑ j = 1 q ( y ′ i j log y i j + ( 1 − y ′ i j ) log ( 1 − y i j ) ) ,

where y i j is the JTH member of the y i vector, representing the prediction correlation probability of the JTH category; y ′ i j is a vector y ′ i the JTH element that represents the JTH category and sample x i irrelevant or relevant; let’s call it 0 or 1. When training the image model, the method of mixed transfer learning is adopted: first, the heterogeneous transfer is used to learn natural images and massive image information, so as to alleviate the overfitting problem, and at the same time, the model can maintain sensitivity to the general features of the image in the general domain, such as color, texture, shape and so on. Specifically, the network resnet-50 is built with Keras, and the network weight published by the Keras author is loaded to obtain the pre-training network on the natural image data set ImageNet. Second, simple single-label homogeneous migration study biomedical images, due to the biomedical composite image containing different patterns, although training given label, there is not corresponding to a specific figure; therefore, by studying single label images, the image content and information associated with a tag are got, and the composite image data set label distribution imbalance caused by the negative impact is weakened. Specifically, the weight of most network layers of the pre-training network Res-Net-50 is fixed, and the single-mode medical Image data sets of image-CLEF2013 and ImageCLEF2016 are used to retrain the top-level fully connected layer. Finally, the two-step transfer learning network model is trained on multi-label medical image data, and multi-label classification is carried out to predict the relevant posterior probability of labels.

2.3 Nonlinear cross-modal label calibration algorithm

The given classifier outputs a test set instance x i ( 1 < i < m ) the posterior probability set of P = { p j ∣ p j ∈ R , 1 < j < q } using the threshold calibration function T; to get the set of predicted tags Y i = { y ∣ y ∈ L } , the formula of element Y is shown in Eq. (6):

(6) y 1 , p j ≥ t 0 , p i < t .

When labeling the current sample, labels higher than the threshold are added to the set of relevant labels according to the posterior probability output by the image model. The following two methods are usually used to select the threshold value: one is the fixed threshold value, the fixed threshold calibration method, which usually uses the popular threshold constant T = 0.5; the other is the dynamic threshold, which is determined by minimizing the difference between the training set and the test label base, as shown in formula (7):

(7) t = arg min t LCard ( X ) − 1 m ∑ i = 1 m ∑ j = 1 q 1 p j > t .

Here, LCard ( X ) is the label cardinality, which is the most natural way to describe the attributes of multi-label data set, namely, the average number of labels per instance, as shown in formula (8):

(8) LCard ( X ) = 1 m ∑ i = 1 n ∣ Y i ∣ .

In this article, the cross-modal model is combined with global preference method and mean value method. Labels are calibrated according to the threshold function (fixed threshold 0.5) and the posterior probability of the output of the image model. If the relevant label set of a sample is empty, the posterior probability average of all labels output is calculated by the image model and text model, and the labels with the largest K average probability are taken as the relevant labels (e.g., K = 1).

3 Result analyses

3.1 The data set

Deep convolutional neural network-based multi-label classification models successfully fuse image representations with label co-occurrence embeddings, proving that the model’s convergence efficiency is significantly enhanced. Furthermore, picture recognition performance has improved as compared to prior systems. In most classes, the cross-modal multi-label image classification approach improves cross-modal retrieval efficiency and can reduce overfitting. The two cross-modal mapping systems’ efficiency and logic are also proven. In this study, the image Lef2016 multi-label classification task data set was used. The training set and test set contain 1,568 and 1,083 images, respectively, and provide corresponding explanatory text. This data set label adopts the category subset of the ImageCLEF2013 pattern recognition task, namely the remaining 30 classes after the composite image category (COMP) is removed. The class codes and class names are shown in Table 1.

Table 1

30 Class codes for multi-label classification

Class code	The class name
DRUS	Ultrasonic image
DRMB	Magnetic resonance imaging
DRCT	Computerized tomography
DRXR	Radiography
DRAN	Angiography
DRPE	Positron emission computed tomography
DRCO	Combined multi-mode image superposition
DVDM	Dermatologic image
DVEN	Endoscopic imaging
DVOR	Images of other organs
DSEE	Electroencephalogram (EEG)
DSEC	Electrocardiogram (ECG)
DSEM	electromyography
DMLI	Optical microscope imaging
DMEL	Electron microscope imaging
DMTR	Transmission microscope imaging
DMFL	Fluorescence microscope imaging
D3DR	Three-dimensional decomposition
GTAB	form
GPLI	The program list
GFIG	Statistical charts
GSCR	Screen capture
GFLO	The flow chart
GSYS	System overview
GGEN	Gene sequence map
GGEL	Gel chromatography
GCHE	Chemical structure diagram
GMAT	A mathematical formula
GNCP	Non-clinical photograph
GHDR	Hand-drawn sketches

The single-label medical image data were derived from two other medical image processing tasks, namely the pattern classification task of ImageCLEF2013 and the subgraph pattern classification task of ImageCLEF2016. The former training set and test set contain 1,796 and 1,568 samples (excluding COMP mode), respectively, while the latter contain 6,676 and 4,166 samples.

3.2 Data preprocessing

When the image is loaded, its size is modified to 224 × 224, and the Keras pretreatment tool is used to convert it into a four-dimensional tensor. The channel-first mode is adopted, that is, (N, 3,224,224), where N is the number of instances. In order to improve the interpretability of word vectors, as many biomedical image captions as possible are collected [20]. In addition to the illustrations provided by the ImageCLEF2016 training set and test set, all illustrations were extracted from 300,000 medical literatures in ImageCLEF2013 [21]. After serialization, the Word2Vec tool is used to train the text to obtain the word vector dictionary, and the word vector quick reference table is constructed according to the description text of ImageCLEF2016, which is used as the weight of the embedded layer of the convolutional neural network, and updated synchronously in the process of network training.

3.3 Evaluation indicators

The evaluation indexes are divided into two types: case-based and tag-based, and two kinds of case-based evaluation indexes are selected. Suppose the test set is shown in formula (9):

(9) S = { ( x 1 , Y 1 ) , ( x 2 Y 2 ) , ⋯ , ( x m , Y m ) } ( x i ∈ X , Y i ⊆ L ) .

Hamming Loss (h-loss for short): Evaluating the misclassification times of instance label, the score ranges from 0 to 1, and 0 represent the best result. The calculation formula is shown in Eq. (10):

(10) h l o s s ( h ) = 1 m ∑ i = 1 m ∣ h ( x i ) Δ Y i ∣ ∣ L ∣ ,

where Δ represents symmetry difference. Mathematically, the symmetry difference of two sets is a set composed of elements that belong to only one set but not the other [22]. Macro average F1 value:F1 value represents the harmonic average of precision and recall rate. The larger the F1 value is, the more effective the classification method is. For example, when the accuracy is fixed, the larger the recall rate is, the larger the F value is, and vice versa [23]. We can test whether there is an unbalanced label distribution that may cause overfitting of some labels. F1_Macro represents the arithmetic average of all label F1 values, and its calculation formula is shown in Eqs. (11)–(14):

(11) p ( h ) = 1 m ∑ i = 1 m ∣ Y i ∩ h ( x i ) ∣ ∣ h ( x i ) ∣ ,

(12) r ( h ) = 1 m ∑ i = 1 m ∣ Y i ∩ h ( x i ) ∣ ∣ Y i ∣ ,

(13) F 1 ( h ) = 2 × p ( h ) × r ( h ) p ( h ) + r ( h ) ,

(14) F 1 M a c r o = 1 q ∑ i = 1 q F 1 i , y i ∈ L .

3.4 Experimental results and discussion

3.4.1 Performance comparison of multi-label classification algorithms

The two best results of the ImageCLEF2016 multi-label classification task were selected as the comparative experimental method. They were both multi-label classification models based on image content. AlexNet was pre-trained on ImageNet, and the transfer learning was carried out to the current data set. According to the maximum score or maximum posteriori probability of SVM output, the label is calibrated as uniquely relevant label [24]. As shown in Table 2, the benchmark algorithm achieves good performance, with the lowest Hamming loss of 0.0131 and the highest macro average F1 value of 0.320.

Table 2

Performance comparison of ImageCLEF2016 multi-label classification algorithms

Methods	10FCV		Test
Methods	H-Loss	F1_Macro	H-Loss	F1_Macro
BMET MLC1 [11]	—	—	0.0131	0.295
BMET MLC2 [11]	—	—	0.0135	0.320
Hetero_TL_V	0.0281	0.171	0.0242	0.237
Hybrid_TL_V	0.0224	0.316	0.0160	0.482
No_TL_T	0.0365	0.082	0.0364	0.024
Homo_TL_T	0.0329	0.117	0.0239	0.185
Hybrid_TL_Cross-Modal	0.0224	0.333	0.0157	0.488

Compared with the most advanced algorithms in this field, hybrid_tl_cross-modality algorithm based on hybrid transfer learning in this article has a similar Hamming loss value and can accurately calibrate label information, and the hamming loss value is as low as 0.0157. The macro average F1 value has been increased by 52.5% to 0.488. In this article, heterogeneous transfer learning is adopted to learn the general characteristics of the general domain from the massive natural images, so as to alleviate the overfitting problem caused by the data scale being too small. Using a homogeneous transfer learning mechanism to learn more domain-specific features from single-label medical images can better alleviate the overfitting problem of some majority tags caused by unbalanced label distribution.

3.4.2 Cross-modal label calibration

The time complexity of cross-modal retrieval can be divided into two parts, that is, the computation time of similarity matrix and the sorting time of the retrieved data. Therefore, Figure 2 only compares the retrieval cost of Minimizing_LCard with th_0.5 on the test set. From Figure 2, it is clear that the proposed approach is great for extracting subgraph pattern information from composite images of the biomedical literature, and it can better ease the problem of overfitting and has high cross-modal retrieval performance

Figure 2

Comparison of the retrieval cost of Minimizing_LCard and th_0.5 on the test set.

Traditional methods were considered, and a comparative analysis has been done to justify the originality of the work. Many methods for investigating label dependencies have been investigated in order to reduce and enhance the label prediction space. Deep convolution architecture is utilized to learn the approximate top-k ranking objective function for multi-label picture recognition. In certain cases, CNN has been used with RNN to represent label dependencies sequentially by embedding semantic labels into vectors The fixed threshold method selects labels with a fixed threshold value of 0.5. If labels are calibrated according to the mean value of the prediction probability output by the two modal models, this calibration method can obtain multi-label classification performance with certain potential, namely, Hamming loss of 0.0161 and macro average F1 value of 0.470 (as shown in Table 3). As can be seen from Table 3, the label calibration method TH_0.5, which combines the globally preferred fixed threshold method and the highest mean probability method, is adopted in this article. Both the Hamming loss and the macro average F1 value are better than the fixed threshold method threshold_0.5, and the macro average F1 value is higher than all other methods.

Table 3

Performance comparison of threshold calibration algorithms

Methods	10FCV		Test
Methods	H-Loss	F1_Macro	H-Loss	F1_Macro
Minimizing_LCard	0.0267	0.348	0.0206	0.477
Threshold_0.5	0.0226	0.326	0.0161	0.470
Highest_Probability	0.0226	0.287	0.0150	0.438
TH_0.5	0.0224	0.333	0.0157	0.488

In Table 3, a comparative analysis has been done with various methods. This is clear that under various test methods, this method has proved better. The macro average F1 value increases from 0.20 to 0.488, which is about 52.5% higher. In this article, the method of minimizing the difference between the predicted tag set and the training tag set (minimizing the difference between the tag cardinality) is used to dynamically determine the threshold. However, the Minimizing_LCard method (Table 3 method Minimizing_LCard) obtains the dynamic threshold of 0.296. According to the strategy adopted in this article, the Hamming loss and macro average F1 are 0.0206 and 0.477, respectively, which are not as good as the performance of TH_0.5. As shown in Table 3, the dynamic threshold method obtained a high macro mean F1 value of 0.477, showing certain potential, but the Hamming loss value was 0.0206, 31.2% different from the method in this article. To find out the reason, the label cardinality of the training set and the test set is checked, which are 1.46 and 1.25 respectively, and there are certain differences. Reducing the difference between the training set and the label cardinality and selecting the threshold value cannot give full play to the advantages of this dynamic threshold method in the current data set. The proposed label calibration method, TH_0.5, achieves a lower Hamming loss of 0.0157 and the highest macro average F1 value of 0.488.

4 Conclusions

In this article, the nature of the complex scene graph one by one more label image classification study, considering the lack of labels in the nature and artificial cost of high cost, this article introduced ZSL mechanism, namely around zero sample label image classification problems are discussed in more in-depth research and analysis and put forward the solution to the problem of the class a unified framework of learning. The corresponding algorithm is designed and improved for each module of the framework. This article proposes a cross-modal multi-label image classification method based on nonlinear, which can capture the pattern features from both the image content and the related description text. After fusing the two modal multi-label classification models, it is more effective than the existing methods to calibrate the biomedical mode labels. Experimental results show that the proposed method is suitable for the task of extracting subgraph pattern information from composite images of biomedical literature, can better alleviate the problem of overfitting, and has good cross-modal retrieval performance. In addition, the effectiveness and rationality of the cross-modal mapping technique are also verified. Moreover, the proposed approach is great for extracting subgraph pattern information from composite images of the biomedical literature, and it can better ease the problem of overfitting and has high cross-modal retrieval performance. Furthermore, picture identification performance has been enhanced when compared to state-of-the-art techniques. In the future, we will add the attention mechanism into our model to extract more accurate image attributes and increase image recognition performance.

Funding information: The authors state no funding involved.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Conflict of interest: The authors state no conflict of interest.
Data availability statement: The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Reference

[1] Xiao, X, Yang J, Ning X. Research on multimodal emotion analysis algorithm based on deep learning. J Phys Conf Ser. 2021;1802(3):032054.10.1088/1742-6596/1802/3/032054Search in Google Scholar

[2] Chen Z, Cong B, Hua Z, Cengiz K, Shabaz M. Application of clustering algorithm in complex landscape farmland synthetic aperture radar image segmentation. J Intell Syst. 2021;30(1):1014–25. 10.1515/jisys-2021-0096.Search in Google Scholar

[3] Chaudhury S, Shelke N, Sau K, Prasanalakshmi B, Shabaz M. A novel approach to classifying breast cancer histopathology biopsy images using bilateral knowledge distillation and label smoothing regularization. Comput Math Methods Med. 2021;2021:4019358. 10.1155/2021/4019358.Search in Google Scholar PubMed PubMed Central

[4] Wang D, Mao K. Task-generic semantic convolutional neural network for web text-aided image classification. Neurocomputing. 2019;329(FEB.15):103–15.10.1016/j.neucom.2018.09.042Search in Google Scholar

[5] Liu Y, Xie Y, Yang J, Zuo X, Zhou B. Target classification and recognition for high-resolution remote sensing images: using the parallel cross-modal neural cognitive computing algorithm. IEEE Geosci Remote Sens Mag. 2020;8(3):50–62.10.1109/MGRS.2019.2949353Search in Google Scholar

[6] Jagota V, Luthra M, Bhola J, Sharma A, Shabaz M. A secure energy-aware game theory (SEGaT) mechanism for coordination in WSANs. Int J Swarm Intell Res. 2022;13(2):1–16. 10.4018/ijsir.287549.Search in Google Scholar

[7] Tang S, Shabaz M. A new face image recognition algorithm based on cerebellum-basal ganglia mechanism. J Healthc Eng. 2021:2021;3688881.10.1155/2021/3688881Search in Google Scholar PubMed PubMed Central

[8] Wang Y, Xie Y, Liu Y, Zhou K, Li X. Fast graph convolution network based multi-label image recognition via cross-modal fusion. Proceedings of the 29th ACM International Conference on Information & Knowledge Management; 2020 Oct 19–23; Online. ACM International, 2020. p. 1575–84.10.1145/3340531.3411880Search in Google Scholar

[9] Duan Y, Chen N, Zhang P, Kumar N, Chang L, Wen W. MS2GAH: Multi-label semantic supervised graph attention hashing for robust cross-modal retrieval. Pattern Recognit. 2022;128:108676.10.1016/j.patcog.2022.108676Search in Google Scholar

[10] Sharma A, Ansari MD, Kumar R. A comparative study of edge detectors in digital image processing. 2017 4th International Conference on Signal Processing, Computing and Control (ISPCC); 2017 Sep 21–23; Solan, India. IEEE; 2018. p. 246–50.10.1109/ISPCC.2017.8269683Search in Google Scholar

[11] Bhola J, Soni S. Information theory-based defense mechanism against DDOS attacks for WSAN. In: Harvey D, Kar H, Verma S, Bhadauria V, editors. Advances in VLSI, Communication, and Signal Processing. Lecture Notes in Electrical Engineering. Vol. 683. Singapore: Springer; 2021. 10.1007/978-981-15-6840-4_55.Search in Google Scholar

[12] Gu J, Liu B, Li X, Wang P, Wang B. Cross-modal representations in early visual and auditory cortices revealed by multi-voxel pattern analysis. Brain Imaging Behav. 2020;14(5):1908–20.10.1007/s11682-019-00135-2Search in Google Scholar PubMed

[13] Liu L, Zhang H, Zhou D. Clothing generation by multi-modal embedding: a compatibility matrix-regularized gan model. Image Vis Comput. 2021;107(8):104097.10.1016/j.imavis.2021.104097Search in Google Scholar

[14] Wang L, Sharma A. Analysis of sports video using image recognition of sportsmen. Int J Syst Assur Eng Manag. 2022;13:1–7.10.1007/s13198-021-01539-4Search in Google Scholar

[15] Zhang S, Srividya K, Kakaravada I, Karras DA, Jagota V, Hasan I, et al. A Global Optimization Algorithm for Intelligent Electromechanical Control System with Improved Filling Function. Sci Program. 2022;2022:3361027. 10.1155/2022/3361027.Search in Google Scholar

[16] Bhola J, Soni S, Cheema GK. Recent trends for security applications in wireless sensor networks – a technical review. 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom); 2019 Mar 13–15; New Delhi, India. IEEE, 2020. p. 707–12Search in Google Scholar

[17] Chen J, Chen L, Shabaz M. Image Fusion Algorithm at Pixel Level Based on Edge Detection. In: Singh D, editor. Hindawi Limited; 2021. J Healthc Eng. 2021;2021:1–10. 10.1155/2021/5760660.Search in Google Scholar PubMed PubMed Central

[18] Zhang X, Li S, Jing XY, Ma F, Zhu C. Unsupervised domain adaption for image-to-video person re-identification. Multimed Tools Appl. 2020;79(45):33793–810.10.1007/s11042-019-08550-9Search in Google Scholar

[19] Huddar MG, Sannakki SS, Rajpurohit VS. Multi-level context extraction and attention-based contextual inter-modal fusion for multimodal sentiment analysis and emotion classification. Int J Multimed Inf Retr. 2020;9(2):103–12.10.1007/s13735-019-00185-8Search in Google Scholar

[20] Xu X, Li L, Sharma A. Controlling messy errors in virtual reconstruction of random sports image capture points for complex systems. Int J Syst Assur Eng Manag. 2021;1–8. 10.1007/s13198-021-01094-y.Search in Google Scholar

[21] Gala R, Budzillo A, Baftizadeh F, Miller J, Sümbül U. Consistent cross-modal identification of cortical neurons with coupled autoencoders. Nat Comput Sci. 2021;1(2):120–7.10.1038/s43588-021-00030-1Search in Google Scholar PubMed PubMed Central

[22] Li D, Wei X, Hong X, Gong Y. Infrared-visible cross-modal person re-identification with an X modality. Proceedings of the AAAI Conference on Artifficial Intelligence; 2020 Feb 7–12; New York (NY), USA. AAAI, 2020. p. 4610–7.10.1609/aaai.v34i04.5891Search in Google Scholar

[23] Chuanxu C, Sharma A. Improved CNN license plate image recognition based on shark odor optimization algorithm. Int J Syst Assur Eng Manag. 2021;1–8. 10.1007/s13198-021-01309-2.Search in Google Scholar

[24] Classen D, Siedt M, Nguyen KT, Ackermann J, Schaeffer A. Formation, classification and identification of non-extractable residues of 14C-labelled ionic compounds in soil. Chemosphere. 2019;232(OCT):164–70.10.1016/j.chemosphere.2019.05.038Search in Google Scholar PubMed

Received: 2022-03-01

Revised: 2022-04-14

Accepted: 2022-04-26

Published Online: 2023-01-24

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/nleng-2022-0194

Keywords for this article

multiple label points; cross-mode retrieval; deep learning

Creative Commons

BY 4.0