Combining bag of visual words-based features with CNN in image classification

Marwa A. Marzouk; Mohamed Elkholy

doi:10.1515/jisys-2023-0054

Artikel Open Access

Combining bag of visual words-based features with CNN in image classification

Marwa A. Marzouk und Mohamed Elkholy

Veröffentlicht/Copyright: 7. März 2024

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Journal of Intelligent Systems Band 33 Heft 1

Abstract

Although traditional image classification techniques are often used in authentic ways, they have several drawbacks, such as unsatisfactory results, poor classification accuracy, and a lack of flexibility. In this study, we introduce a combination of convolutional neural network (CNN) and support vector machine (SVM), along with a modified bag of visual words (BoVW)-based image classification model. BoVW uses scale-invariant feature transform (SIFT) and Oriented Fast and Rotated BRIEF (ORB) descriptors; as a consequence, the SIFT–ORB–BoVW model developed contains highly discriminating features, which enhance the performance of the classifier. To identify appropriate images and overcome challenges, we have also explored the possibility of utilizing a fuzzy Bag of Visual Words (BoVW) approach. This study also discusses using CNNs/SVM to improve the proposed feature extractor’s ability to learn more relevant visual vocabulary from the image. The proposed technique was compared with classic BoVW. The experimental results proved the significant enhancement of the proposed technique in terms of performance and accuracy over state-of-the-art models of BoVW.

Keywords: CNNs; BoVW; image classification; deep learning; SVM

1 Introduction

The systematic division of objects into groups and categories according to their properties is known as classification. Using data to teach the computer, image classification was developed to bridge the gap between computer vision and human vision. Image classification involves categorizing images into appropriate categories based on their content. Problems with the process of classifying images include changes in viewpoint, lighting, partial obstruction clutter, and intra-class visual diversity [1]. Numerous studies have attempted to address these problems. Recently, two well-known classification frameworks, bag of visual words (BoVW) and deep learning techniques, have produced many promising results.

“BoVW” method has recently been shown to perform exceptionally well in image classification and semantic image analysis across several well-known databases [2]. BoVW is an effective method for characterizing images since visual words signify local information taken from a large number of images and quantize to construct a visual vocabulary (codebook) [3,4]. Finally, an image is characterized by a histogram, with each bin representing a visual word. As a result, creating a BoVW signature needs three steps: (1) creating a visual vocabulary, (2) extracting local descriptors, and (3) indexing images. The mappings between the visual words and the features retrieved from the images in traditional BoVW are unclear and uncertain. The challenge of choosing the suitable term from two or more relevant possibilities is referred to as word uncertainty [5]. Selecting a term without a good candidate in the codebook is referred to as plausibility. To overcome such problems, the codebook is constructed using the fuzzy clustering approach and the fuzzy C-means algorithm. Furthermore, the BoVW system depends on a small number of training samples to identify the spatial properties of each pixel. Even though their benefits have been demonstrated in small datasets, their effectiveness in large datasets is unpredictable and less satisfactory. The deep learning method based on huge data, on the other hand, performs better.

Deep learning reflects the human brain that is organized through several transformation and representation phases in a deep architecture and processes data [6]. Deep learning techniques that largely rely on neural networks have attracted researchers’ curiosity in recent years. Each node (i.e., artificial neuron) in a network is associated with a feature, and neurons in succeeding layers generalize the important characteristics from the preceding layer. Krizhevsky et al. [7] demonstrated promising outcomes when using deep learning techniques. Convolutional neural network (CNN) is widely used for image extraction especially in large database [8]. CNNs are a specific kind of deep neural network used in deep learning to assess visual data [9]. Marzouk and Elkholy [10] used the spatial structure of an image to extract information. Since 2012, using CNN methods to solve computer vision issues such as image classification and object recognition has been mainstream. The potential of CNN techniques to address issues in the area of computer vision – deep learning techniques – was substantially increased by the availability of high-performance computing resources, effective algorithms, and large-scale data [11]. Convolutional layers, pooling layers, non-linear activations, and fully connected layers form up CNN. The existence of diverse forms, which are identified by different convolutional kernels, is reflected in the convolutional layer. By sampling the convolutional layer, the pooling layer keeps the maximal saliency in the local region. The multi-layer perceptron is represented by fully connected layers, which are usually appended at the end. Learning complex functions necessitates the use of non-linear activation functions.

In this study, we use CNN/support vector machine (SVM) to enhance the suggested feature extractor, so that it can learn more suitable visual vocabularies from the images, by combining an adaptive CNN/SVM and a modified BoVW. In several datasets with various image classifications, our technique achieved good performances in terms of classification accuracies. The rest of this article is organized as follows: related work is discussed in Section 2. The proposed model is introduced in Section 3. The experimental results are stated in Section 4, while Section 5 includes the conclusion.

2 Related work

Image classification research has sparked a lot of interest in recent years, owing to how difficult it is to satisfy both the accuracy and the speed of image categorization. The BoVW has been proposed, effectively implemented, and thoroughly researched in the context of image classification. For instance, Csurka et al. [12] proposed an approach for image scene classification using a visual word packet model. Song and Tao [13] and Quelhas et al. [14] found improvements in performance in the use of local features in a scene grading algorithm. The spatial pyramid matching (SPM) was later developed by Lazebnik et al. [15] and was successful in the categorization of the scene using scale-invariant feature transform (SIFT) features. Oliva and Torralba [16] presented certain global image structure tensor features that are intended to illustrate the scene's spatial structure without showing fine texture information about the objects or backgrounds in the image. The enormous volume of data makes the conventional technique based on low-level features and high-level semantics impossible to handle when an image type contains thousands of categories and a database capacity of one million [17]. Ergene and Durdu [18] proposed using the BoVW model to generate image classification on robotic hands using linear SVM. The aforementioned model might have a significant impact on image recognition in the robotics sector in the future, as a consideration of robotics advancement. The researchers attempted to offer an effective strategy for overcoming all of the obstacles in object categorization. For object classification, traditional classification methods such as handmade features (HCFs) are used; however, the complicated object problem limited HCF implementation [19]. By combining the HSV color statistical characteristic and the multichannel local binary pattern (LBP)-oriented color descriptor, Latha and Sheela [20] created an innovative color image descriptor for an enhanced hybrid CBIR system. Qi et al. [21] described a visual bag of semantic words image classification model that combines a semantic annotation model that focuses on SVMs with an automatic segmentation method based on graph cuts to find relevant semantic areas. In order to create the bag of features, Ghahremani et al. [22] proposed an image feature collaboration strategy that relied on the LBP descriptor, SIFT as a local invariant descriptor, and the histogram of oriented gradients (HOG) as a global feature. The HOG creates large background cluttering features, whereas the SIFT and LBP locate characteristic keypoints. In Dave et al. [23], SIFT, Oriented Fast and Rotated BRIEF (ORB), Binary Robust Invariant Scalable Keypoints (BRISK) and Speeded up robust features (SURF) are four feature descriptors provided by the authors for an image classification model based on BoVW. The Caltech101 dataset was used, with 261 pictures labeled using three different labels: aircraft, helicopters, and motorcycles. Using the K-nearest neighbors their average accuracy with ORB, SIFT, BRISK, and SURF was 57, 71, 62, and 77%, respectively. With the SURF descriptor, the most accurate classification result was achieved by SVM, with an average classification accuracy of 85%.

As a result, even the most complex and successful descriptions are giving way to deep learning [24]. In scene categorization tasks, deep CNN, in particular, has made a remarkable breakthrough. Vakalopoulou et al. [25] provided a method for automated feature extraction based on deep learning. The projected deep features may also account for the difference between building-related and non-building-related objects if additional spectral information is included during the training process [26]. Zuo et al. [27] described a method for categorizing hyperspectral data that uses deep learning to encode both spectral and spatial information in pixels. Hussain et al. [28] improved the performance of the classification technique by merging classical features with deep CNN features generated using the latest version of Inception [29] following feature optimization using joint entropy. He et al. [30] discovered that in-depth CNN training needs a uniform size and that uniform picture zoom produces image or cutting distortion; hence, they built the pyramid space (spatial pyramid pooling, SPP polymerization). To utilize the linear Support Vector Machine (SVM) for classification, Cimpoi et al. [31] first used the before the CNN to get the MIT-67 database of the convolutional feature and the total connectivity feature from the PLACES database. Krizhevsky et al. [7] implemented a deep supervised backpropagation convolutional network for digit recognition. An unsafe objects dataset was used by the Kibria and Hasan [32] to illustrate the image feature extraction and classification. There were a number of models put out, including CNN, BoVW with SURF as a feature descriptor, and SVM as classifier. A total of 2,000 images were used, of which 1,000 showed knives and the remaining 1,000 showed subjects with no knives. The outcomes from BoVW and CNN had accuracy percentages that were 84 and 87%, respectively. Both international and regional characteristics are used in the study by Raikar and Joshi [33]. BoVW was created by combining LBP and ORB with global features to create a single vector that was then used with several classifiers on the Flowers dataset. Their best classification model was the Random Forest classifier, which had an accuracy of 64.13% using the KNN classifier.

Generally speaking, the proposed method presents a new image classification system that overcomes most of the problems that image classification systems face. By offering a fuzzy model of BOVW that offers helpful ideas and methods to cope with imprecise and uncertain image data, the improved BOVW reduces the disparity between low-level features and the human precipitation of image data. Additionally, approaches based on high-level semantics and low-level features frequently suffer with vast volumes of data; thankfully, those based on deep learning do better.

3 Proposed model

Images were first classified using raw pixel data. As a result, computers would break down images into their basic pixels. Two images of the same features might seem significantly differently, which is the problem. For instance, they could have different origins, perspectives, and views. As a result, computers have trouble correctly “seeing” and categorizing images. In the proposed study, we present a model that combines an adaptive CNN with a modified BoVW. CNNs were used to improve the proposed feature extractor’s ability to acquire more appropriate visual vocabularies from images. The recommended system’s method architecture is depicted in Figure 1 that is broken into numbers of steps as follows:

Figure 1

Proposed method.

3.1 Setting up the structure of image data

For an image classification problem to be solved, the data must be in a specific format. There should be two folders: one for the train set and one for the test set. A.csv file containing the names of all the training photos and their true labels, as well as an image folder containing all of the training images, is included in the training set. The system will be trained using images from the training set, and label predictions will be made using images from the testing set.

3.2 Smoothing and edge detection

In order to mitigate the impact of image noise, one can use several smoothing filters, including but not limited to, the Gaussian filter, the median filter, or the bilateral filter. The aforementioned filters use a technique wherein the pixel value is substituted with a weighted average of its neighboring pixels. This process effectively reduces the presence of noise in the image, while simultaneously retaining the integrity of the edges. The Sobel edge detection approach was then applied.

3.3 Modified model of BoVW

The BoVW model is primarily intended for image descriptors that define regions surrounding picture key points. Image classification is carried out in this research using a modified BoVW that incorporates two local feature descriptors, SIFT and ORB, as shown in Figure 2. However if a large number of such local descriptors are extracted for each image, searching for near neighbors in the query image becomes highly time-consuming for each local descriptor [34,35]. The set of visual keywords feature vectors for each descriptor is created using a fuzzy C-means method. To overcome the problem of uncertainty and increase the performance of the description, a modified model BoVW with multiple stages is proposed, as illustrated in Figure 2.

Figure 2

MBOVW.

3.3.1 Step 1: Extracting local descriptors

SIFT and ORB are used to provide local semantic descriptors in this study. This study suggests an innovative method for classifying images implementing SIFT and ORB [36]. The ORB and SIFT features are extracted separately, and then, their combination is used to boost the efficiency of the system under consideration. The most robust features from the image are extracted using the integrated approach. While the ORB descriptor only uses 32 elements in one key point, the SIFT descriptor uses a 128-element feature dimension, necessitating a large amount of memory for storing features. To further simplify the integrated approach, fuzzy C-means clustering is used.

3.3.1.1 SIFT descriptor

The SIFT descriptors convert an image into a collection of local feature vectors. Each of these feature vectors may be used to match distinct perspectives of an item or scene in a trustworthy manner. As a result, SIFT characteristics have been proven to be highly successful for general object identification and recognition even in the presence of a large amount of occlusion. SIFT is divided into four basic stages [37]: Before the Descriptor (Orientation Assignment-Feature Description), the Detector (Find Scale Space Extreme Detection-Key point localization and filtering), the Similarity, and finally, the Feature Matching. The SIFT code was described as the next.

import cv2

# read the images

img1 = cv2.imread(‘book.jpg’)

img2 = cv2.imread(‘table.jpg’)

# convert images to grayscale

img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)

img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

# create SIFT object

sift = cv2.xfeatures2d.SIFT_create()

# detect SIFT features in both images

keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None)

keypoints_2, descriptors_2 = sift.detectAndCompute(img2,None)

3.3.1.2 ORB

ORB is a technique for extracting the fewest but most valuable elements from an image. Compared to SIFT and SURF, the cost of computing is also lower. ORB uses a Harris corner detector to discover excellent features from those key points after using the FAST keypoint detector to identify a large number of them. Extracted features produce superior outcomes while being less susceptible to noise. Using the patch moment in ORB, the following equation will be used to compute the image’s centroid [23]:

(1) m pq = ∑ xy x p y q ( x , y ) .

The intensity centroid of image patches is used to compute the corner orientation using the following equation:

(2) c m 10 m 00 , m 01 m 00 .

The angle between the patch’s center and its centroid is given by the following equation:

(3) atan 2 m 10 m 00 , m 01 m 00 = atan 2 ( m 01 , m 10 ) .

The ORB uses OFAST for the fast key, which concentrates on direction, and RBRIEF for the BRIEF description, which includes the orientation (rotation) angle. The orientation transforms the sample used in the BRIEF description. The ORB code was described as the next.

import cv2 as cv

from matplotlib import pyplot as plt

img = cv.imread(‘simple.jpg’, cv.IMREAD_GRAYSCALE)

# Initiate ORB detector

orb = cv.ORB_create()

# find the keypoints with ORB

kp = orb.detect(img,None)

# compute the descriptors with ORB

kp, des = orb.compute(img, kp)

# draw only keypoints location,not size and orientation

img2 = cv.drawKeypoints(img, kp, None, color = (0,255,0), flags = 0)

plt.imshow(img2), plt.show()

3.3.2 Step 2: Building a visual vocabulary (Codebook)

Each of an image’s feature descriptor is given a single visual word in the traditional BoVW approach. This challenging task raises two concerns [38]: (1) the term “codeword uncertainty” relates to the difficulty of identifying the proper codeword from two or more relevant possibilities and (2) codeword plausibility refers to the difficulty of choosing a codeword without a good candidate in the lexicon. To address these difficulties, incorporate fuzziness into the vocabulary creation and assignment processes. Both these problems are illustrated in Figure 3.

Figure 3

An illustration of the issues with codeword ambiguity. The circles represent discovered codewords, while the tiny dots indicate image features. The yellow triangle represents a data sample that is appropriate for the codebook technique. The green square displays the issue of codeword uncertainty, whereas the diamond displays the issue of codeword plausibility [39].

3.3.2.1 Fuzzy C-means clustering

In the fuzzy vector quantization framework, each feature is assigned to several codewords rather than one. Each feature is assigned to many code words with a membership value indicating its relevance to each codeword using an uncertainty phase. As seen in Altintakan and Yazici [40], a membership function in Fuzzy C-means uses u ij to change the objective function. The collection of visual feature descriptors N is used here. The FCM looks for a set S with c cluster centers that is optimum.

(4) J fcm ( S ) = ∑ i = 1 C ∑ j = 1 N u ij m ‖ x j − c i ‖ 2 ,

subject to the condition:

(5) ∑ i = 1 c u ij = 1 . ∀ J ,

where u ij is the value of membership of the jth feature to the ith codeword. Using iterative optimization, the aforementioned update equations are used to minimize the aforementioned equation:

(6) u ij = d ij − 2 m − 1 ∑ i − 1 c d ij − 2 m − 1 ,

(7) c i = ∑ j = 1 N u ij m xj ∑ j = 1 N u ij m ,

where d ij is distance of the Jth feature to the Ith codeword, respectively, and m, (m > 1) is known as the “fuzzifier” or “weighting exponent,” and its value influences how much fuzziness is added to the assignments [40].

3.3.3 Step 3: SIFT–ORB features integration

Combine the ORB and SIFT feature vectors, and save the resulting feature vector in a database.

3.3.4 Step 4: Visual indexing

The images are indexed into a collection to generate their fuzzy BOVW signatures once the visual vocabulary is formed. Each image is represented as a histogram, with the bins representing the vocabulary entries and the weights representing the image’s appearance frequencies. Our BoVW was this histogram. We put it in a pickle and used it to train our SVM/CNN. Additionally, it is clear from both references [41,42] that CNN and SVM classifiers are extremely effective in classifying images. Our study mostly attempts to improve classification accuracy for particular use situations and gives an accurate assessment of how both classifiers respond to different image features.

We examine the SVM’s retrieved input features using t-distributed stochastic neighbor embedding computations and parallel coordinate graphs. Furthermore, the SVM classification score and feature vector distances inside the SVM are examined. The CNN’s first joined convolutional-layer outputs are associated with raw input images of various classes using structural similarity index measure metrics. The classification rates of the CNN are also examined. As a result, an appropriate uncertainty confidence interval for the CNN is calculated based on its neural activations. Finally, for the SVM classifier, the importance of individual pixels for the CNN decision is evaluated using smooth integrated gradients and related to the manually derived image characteristics.

3.4 Model training SVMs

As a result of its dynamic nature, we use SVM to train the model. In this sense, SVM has some advantages, including the capacity to avoid over-fitting. In addition, the model may accommodate any number of input spaces. To prevent these high-dimensional input spaces, we assume that the majority of the features are ineffective. Feature extraction/selection tries to find the features that are not required. Any classifier that uses the worst features outperforms one that uses random features [43,44].

3.5 CNN

The CNN used in this study consists of 16 hidden convolution layers as shown in figure in the articles [45,46], a maximum layer of pooling, and completely related layers. The clear CNN description is illustrated in Figure 5. In this CNN, the activation function called rectified linear unit is applied, with both convolutions and fully connected layers, while the SoftMax activation function applies to the output layer to estimate different class probabilities.

TensorFlow-backend Keras is used as a framework for deep learning for classification and assessment. This reduces the number of written boilerplate codes. This also contributes to faster research.

4 Experiments

4.1 Datasets

The proposed image classification system is tested and evaluated by two benchmarked image databases. The first dataset is a COREL database. The idea behind choosing the COREL database is the large size of the image and variant classes and categories as shown in Figure 4. Hence, it allows accurate evaluation of the performance of the proposed work. These images are stored in a dataset of sizes 10,000. All have the same file format.jpg and same size 384 by 256 pixels. To choose a query set for performance evaluation, about 1,000 images from 10,000 COREL images categories are detected as testing data. The second dataset is a The Caltech-101. The Caltech-101 dataset is a difficult object identification dataset with 9,144 images divided into one background class and 101 object classes as seen in Figure 5. The number of photos in each class varies between 31 and 800.

Figure 4

COREL database [47].

Figure 5

Caltech-101 dataset [45].

4.2 Performance evaluation metrics

Performance evaluation was divided into two individual tests. The first test is the performance testing of a fuzzy model of a BOVW, and the second test is the performance testing of CNN.

The accuracy with which a system can classify objects is critical in determining its performance. If the results are promising and satisfactory, they can serve as a benchmark for future research. Precision–recall is the most extensively used evaluation approach for assessing classification accuracy in image classification. This pair has been used in certain recent works of literature to assess image classification accuracy [48].

(8) Recall = TP TP + FN ,

(9) Percision = TP TP + FP .

True positives, true negatives, false positives, and false negatives are all represented by the letters TP, TN, FP, and FN [49]:

(10) Accuracy rate = Recall + Percision 2 × 100 .

The F-score is an indication of the proportion of precision and recall’s harmonic means; a higher number indicates that the system has superior predictive capacity. Precision or recall alone is insufficient for evaluating system performance. The F-score might be written numerically as:

(11) F - Score = 2 × Precision × Recall Precision + Recall .

The F-score is used to compare performance in cases where one strategy has better accuracy but lower recall rate than the comparator.

4.3 Experimental results

4.3.1 BOVW experiment results

For the Corel database evaluation, Table 1 shows the precision, recall, and F-score results for each class. For the Corel images, the average precision, recall, and F-score values are 97, 96.9, and 97.05%, respectively. This demonstrates the good prediction performance of the proposed study.

Table 1

Precision, recall, and F-score results for each class for the Corel database

Class name	Precision (%)	Recall (%)	F-score (%)
African	90	88.9	90
Beach	86.7	86.6	86.8
Building	100	100	100
Bus	100	100	100
Dinosaur	100	100	100
Elephant	100	100	100
Flower	100	97.6	98.6
Food	100	100	100
Horse	100	100	100
Mountain	93.2	96.55	95.1
Average	97	96.9	97.05

Comparing the accuracy rate of the recommended improved BOVW with other current models is one aspect of the recommended system’s evaluation. By adjusting the various feature descriptors and the codebook methodology, the performance of three alternative parameter combinations has been evaluated: (1) SIFT descriptor K-means clustering is one of the three combinations (SIFT-LBP-K-means), (2) LBP and K-means clustering with the SIFT descriptor (SIFT-LBP-K-means), and (3) K-mean clustering using the SIFT and ORB descriptors (SIFT-ORB-K-means). Evaluation is done by varying the number of images from 500 to 10,000 from COREL database. The proposed technique outperforms the existing algorithms with various database sizes, as shown in Figure 6. Hence, among all of the other combinations systems, SIFT-ORB performs better with a varying number of images. Additionally, the average accuracy rate for the proposed algorithm is 87.2%, while the average accuracy rates for SIFT-K-means, SIFT-LBP-K-means, and SIFT-ORB-K-means are 73.9, 77.5, and 76.95%, respectively. The rationale for these results is because, in the case of FCM, the interpretation of m (fuzziness degree) is different, with values of m increasing the sharing of points among all clusters, resulting in greater performance. In addition, combining SIFT and ORB descriptors revealed that ORB descriptor has a very robust textural feature that can be used to enhance SIFT by filtering out noises. As a result, combining these two feature descriptors can better capture the features of an item in an image.

Figure 6

Comparative results.

The accuracy results obtained by the suggested model descriptor SIFT–ORB BoVW and other related works using the Caltech101 dataset are presented in Table 2 [23,50]. The suggested model outperforms the model in the following table. It involves the use of SIFT or ORB. However, the most accurate classification was achieved by Karim and Sameer [50]. In the previous research, only three database classes were used, whereas in this research, six database classes were used, resulting in a variety of keypoints.

Table 2

Comparison of average accuracy rates

Methods	Dataset	Local descriptors	Clustering	Accuracy %
Proposed method	Caltech101	SIFT-ORB	Fuzzy C-mean	85.6
Karim and Sameer [50]	Caltech101	SIFT	K-means	86.6
Vinoharan and Ramanan [51]	Caltech101	SIFT	K-means	81.39
Dave et al. [23]	Caltech101	SIFT	K-means	72
Dave et al. [23]	Caltech101	ORB	K-means	57
Chebbout and Merouani [52]	Caltech101	SIFT	K-means	69.15

4.3.2 Performance evaluation of CNN

This study aims to test CNN’s/SVM’s performance with several classifiers and different image database classes, classifier such as the SVM, the ANN/multi-layer perceptron, the decision tree, and the KNN as shown in Table 3.

Table 3

Evaluation of performance

Classifier	Class 1		Class 2		Class3
Classifier	Accuracy %	Run Ttime (s)	Accuracy %	Run time (s)	Accuracy %	Run time (s)
SVM	85.68	44.4	86	44	87	43.56
Decision trees	86.9.61	39.4	85	45.6	84	42.5
KNN	86.32	56.6	86	50.6	86	39.8
ANN (for 300 epochs)	83.10	38.2	88	47.8	87	45.2
CNN/SVM (for 300 epochs)	87.2	37.6	93	39.5	89	38.8

Because this model produced the best results of all, it was trained for a longer period and attained 87.2% accuracy with 300 epochs and less computing time. The performance table shows that CNN/SVM produces the best results in image classification tasks.

5 Conclusion

For image classification, we suggested an adaptive convolution neural network and BoVW. In BoVW, an image is represented as a collection of features comprising key points and descriptions. We can see that prediction and classification can be done rapidly and accurately with this model. In the experiments, the proposed system is tested and evaluated by two benchmarked image databases. The first database is a COREL database, and the second is a Caltech-101 dataset. Comparing the accuracy rate of the recommended improved BOVW with other current models is one aspect of the recommended system’s evaluation. Hence, among all of the other combination systems, SIFT–ORB performs better with a varying number of images, where the average accuracy rate for the proposed algorithm is 87.2%. This study also aims to test CNN’s/SVM’s performance with several classifiers and different image database classes, the model produced the best results of all, and it was trained for a longer period and attained 87.2% accuracy with 300 epochs and less computing time. According to the results of the trials, this study revealed that by integrating MBOVW with deep learning in a hybrid methodology, it is possible to develop an efficient and trustworthy system for classification and prediction. Generally, the prospects for the use feature extractions algorithm and MBoVW in medical and other disciplines of research are considerable. And when applied to the medical realm, these descriptions are powerful tools.

Funding statement: The authors did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript. No funding was received for conducting this study.
Author contributions: All authors have participated in (a) conception and design or analysis and interpretation of the data, (b) drafting the article or revising it critically for important intellectual content, and (c) approval of the final version.
Conflict of interest: All authors declare that they have no conflicts of interest.
Data availability statement: Publicly available datasets were analyzed in this study. This data can be found at: https://www.kaggle.com/datasets/elkamel/corel-images and https://www.kaggle.com/datasets/imbikramsaha/caltech-101.

References

[1] Kumar S, Ansari MD, Gunjan VK, Solanki VK. On classification of BMD images using machine learning (ANN) algorithm. In ICDSMLA 2019. Proceedings of the 1st International Conference on Data Science, Machine Learning and Applications. Singapore: Springer; 2020. p. 1590–9.10.1007/978-981-15-1420-3_165Suche in Google Scholar

[2] Dias G, Moreno JG, Jatowt A, Campos R. Temporal web image retrieval. In International Symposium on String Processing and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg; 2012. p. 199–204.10.1007/978-3-642-34109-0_21Suche in Google Scholar

[3] Jiang X, Ma J, Xiao G, Shao Z, Guo X. A review of multimodal image matching: Methods and applications. Inf Fusion. 2021;73:22–71.10.1016/j.inffus.2021.02.012Suche in Google Scholar

[4] Rashid E, Ansari MD, Gunjan VK, Ahmed M. Improvement in extended object tracking with the vision-based algorithm. Modern approaches in machine learning and cognitive science: A Walkthrough: Latest Trends in AI; 2020. p. 237–45.10.1007/978-3-030-38445-6_18Suche in Google Scholar

[5] Jégou H, Douze M, Schmid C. Improving bag-of-features for large scale image search. Int J Comput Vis. 2010;87(3):316–36.10.1007/s11263-009-0285-2Suche in Google Scholar

[6] Madbouly M, Elkholy M, Gharib YM, Darwish SM. Predicting stock market trends for Japanese candlestick using cloud model. In Proceedings of the International Conference on Artificial Intelligence and Computer Vision, AICV 2020, Advances in Intelligent Systems and Computing. Vol. 1153. Cham: Springer; 2020. 10.1007/978-3-030-44289-7_59.Suche in Google Scholar

[7] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1097–105.Suche in Google Scholar

[8] Elkholy M, ElFatatry A. Framework for interaction between databases and microservice architecture. IT Prof. 2019 Sep–Oct;21(5):57–63. 10.1109/MITP.2018.2889268.Suche in Google Scholar

[9] Valueva MV, Nagornov NN, Lyakhov PA, Valuev GV, Chervyakov NI. Application of the residue number system to reduce hardware costs of the convolutional neural network implementation. Math Comput Simul. 2020;177:232–43.10.1016/j.matcom.2020.04.031Suche in Google Scholar

[10] Marzouk MA, Elkholy M. Deep image: An efficient image-based deep conventional neural network method for android malware detection. J Adv Inf Technol. 2023;14(4):838–45.10.12720/jait.14.4.838-845Suche in Google Scholar

[11] Elkholy M, Baghdadi Y, Marzouk M. Snowball framework for web service composition in SOA applications. Int J Adv Comput Sci Appl. 2022;13(1):343–50. 10.14569/IJACSA.2022.0130143.Suche in Google Scholar

[12] Csurka G, Dance C, Fan L, Willamowski J, Bray C. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision. Vol. 1, Prague: ECCV; 2004.Suche in Google Scholar

[13] Song D, Tao D. Biologically inspired feature manifold for scene classification. IEEE Trans Image Process. 2009;19(1):174–84.10.1109/TIP.2009.2032939Suche in Google Scholar PubMed

[14] Quelhas P, Monay F, Odobez JM, Gatica-Perez D, Tuytelaars T, Van Gool L. Modeling scenes with local descriptors and latent aspects. In Tenth IEEE International Conference on Computer Vision (ICCV'05). Vol. 1. IEEE; 2005.10.1109/ICCV.2005.152Suche in Google Scholar

[15] Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06). IEEE; 2006.Suche in Google Scholar

[16] Oliva A, Torralba A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int J Comput Vis. 2001;42(3):145–75.10.1023/A:1011139631724Suche in Google Scholar

[17] Elkholy MI, Marzok MA. Trusted microservices: A security framework for users’ interaction with microservices applications. JISCR. 2022;5(2):135–43.10.26735/QOPM9166Suche in Google Scholar

[18] Ergene MC, Durdu A. Robotic hand grasping of objects classified by using support vector machine and bag of visual words. In 2017 International Artificial Intelligence and Data Processing Symposium (IDAP). IEEE; 2017.10.1109/IDAP.2017.8090228Suche in Google Scholar

[19] Rashid M, Khan MA, Sharif M, Raza M, Sarfraz MM, Afza F. Object detection and classification: A joint selection and fusion strategy of deep convolutional neural network and SIFT point features. Multimed Tools Appl. 2019;78:15751–77.10.1007/s11042-018-7031-0Suche in Google Scholar

[20] Latha D, Sheela CJJ. Enhanced hybrid CBIR based on multichannel LBP oriented color descriptor and HSV color statistical feature. Multimed Tools Appl. 2022;81(17):23801–18.10.1007/s11042-022-12568-xSuche in Google Scholar

[21] Qi Y, Zhang G, Li Y. Image classification model using visual bag of semantic words. Pattern Recognit Image Anal. 2019;29(3):404–14.10.1134/S1054661819030222Suche in Google Scholar

[22] Ghahremani M, Ghadiri H, Hamghalam M. Local features integration for content-based image retrieval based on color, texture, and shape. Multimed Tools Appl. 2021;80(18):28245–63.10.1007/s11042-021-10895-zSuche in Google Scholar

[23] Dave M, Ganatra A, Israni D. Evaluating classifiers and feature detectors for image classification bovw model: A survey. Int J Comput Eng Appl. 2017;12:1–7.Suche in Google Scholar

[24] Marzouk MA, Abd El Azeem A. Vehicles detection and counting based on internet of things technology and video processing techniques. IAES Int J Artif Intell. 2022;11(2):405.10.11591/ijai.v11.i2.pp405-413Suche in Google Scholar

[25] Vakalopoulou M, Karantzalos K, Komodakis N, Paragios N. Building detection in very high resolution multispectral data with deep learning features. In 2015 IEEE international geoscience and remote sensing symposium (IGARSS). IEEE; 2015. p. 1873–6.10.1109/IGARSS.2015.7326158Suche in Google Scholar

[26] Merentitis A, Debes C. Automatic fusion and classification using random forests and features extracted with deep learning. In 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE; 2015. p. 1873–6.10.1109/IGARSS.2015.7326432Suche in Google Scholar

[27] Zuo Z, Wang G, Shuai B, Zhao L, Yang Q, Jiang X. Learning discriminative and shareable features for scene classification. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part I 13. Springer International Publishing. p. 552–68.10.1007/978-3-319-10590-1_36Suche in Google Scholar

[28] Hussain N, Khan MA, Sharif M, Khan SA, Albesher AA, Saba T, et al. A deep neural network and classical features based scheme for objects recognition: an application for machine inspection. Multimed Tools Appl. 2020;1–23.10.1007/s11042-020-08852-3Suche in Google Scholar

[29] Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 2818–26.10.1109/CVPR.2016.308Suche in Google Scholar

[30] He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.10.1109/TPAMI.2015.2389824Suche in Google Scholar PubMed

[31] Cimpoi M, Maji S, Vedaldi A. Deep filter banks for texture recognition and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015.10.1109/CVPR.2015.7299007Suche in Google Scholar

[32] Kibria SB, Hasan MS. An analysis of feature extraction and classification algorithms for dangerous object detection. In 2017 2nd International Conference on Electrical & Electronic Engineering (ICEEE). IEEE; 2017.10.1109/CEEE.2017.8412846Suche in Google Scholar

[33] Raikar P, Joshi S. Efficiency comparison of supervised and unsupervised classifier on content based classification using shape, color, texture. In 2020 International Conference for Emerging Technology (INCET). IEEE; 2020.10.1109/INCET49848.2020.9154016Suche in Google Scholar

[34] Elkholy M, Marzok MA. Light weight serverless computing at fog nodes for internet of things systems. Indones J Electr Eng Comput Sci. 2022 Apr;26(1):394–403. 10.11591/ijeecs.v26.i1.pp394-403.Suche in Google Scholar

[35] Heczko M, Hinneburg A, Keim D, Wawryniuk M. Multiresolution similarity search in image databases. Multimed Syst. 2004;10(1):28–40.10.1007/s00530-004-0135-6Suche in Google Scholar

[36] Madduri A. Content based image retrieval system using local feature extraction techniques. Int J Comput Appl. 2021;183(20):16–20.10.5120/ijca2021921549Suche in Google Scholar

[37] Chen J, Shan S, He C, Zhao G, Pietikäinen M, Chen X. WLD: A robust local image descriptor. IEEE Trans Pattern Anal Mach Intell. 2009;32(9):1705–20.10.1109/TPAMI.2009.155Suche in Google Scholar PubMed

[38] Šinjur S, Zazula D. Image similarity search in large databases using a fast machine learning approach. In New Directions in Intelligent Interactive Multimedia. Berlin: Springer; 2008. p. 85–93.10.1007/978-3-540-68127-4_9Suche in Google Scholar

[39] Altintakan UL, Yazici A. Towards effective image classification using class-specific codebooks and distictive local features. IEEE Trans Multimed. 2015;17(3):323–32.10.1109/TMM.2014.2388312Suche in Google Scholar

[40] Altintakan UL, Yazici A. An improved BOW approach using fuzzy feature encoding and visual-word weighting. In 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE; 2015.10.1109/FUZZ-IEEE.2015.7338108Suche in Google Scholar

[41] Elkholy M, baes Mohamed A. Efficient security model for RDF files used in IoT applications. Int J Adv Comput Sci Appl. 2021;12(4):233–9. 10.14569/IJACSA.2021.0120431.Suche in Google Scholar

[42] Zhao X, Shi X, Liu K, Deng Y. An intelligent detection and assessment method based on textile fabric image feature. Int J Cloth Sci Technol. 2019;31(3):390–402.10.1108/IJCST-01-2018-0005Suche in Google Scholar

[43] Paoletti ME, Haut JM, Tao X, Miguel JP, Plaza A. A new GPU implementation of support vector machines for fast hyperspectral image classification. Remote Sens. 2020;12(8):1257.10.3390/rs12081257Suche in Google Scholar

[44] Bay H, Tuytelaars T, Van Gool L. Surf: Speeded up robust features. In European Conference on Computer Vision. Springer; 2006.10.1007/11744023_32Suche in Google Scholar

[45] Hemanth DJ, Anitha J, Mittal M. Diabetic retinopathy diagnosis from retinal images using modified hopfield neural network. J Med Syst. 2018;42(12):1–6.10.1007/s10916-018-1111-6Suche in Google Scholar PubMed

[46] Ngoc VTN, Agwu AC, Son LH, Tuan TM, Nguyen Giap C, Thanh MTG, et al. The combination of adaptive convolutional neural network and bag of visual words in automatic diagnosis of third molar complications on dental x-ray images. Diagnostics. 2020;10(4):209. 10.3390/diagnostics10040209Suche in Google Scholar PubMed PubMed Central

[47] Jiang D, Kim J. Image Retrieval Method Based on Image Feature Fusion and Discrete Cosine Transform. Appl Sci. 2021;11(12)5701. 10.3390/app11125701Suche in Google Scholar

[48] Zhang R, Zhang Z. A clustering based approach to efficient image retrieval. In Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence, 2002. (ICTAI 2002). IEEE; 2002.Suche in Google Scholar

[49] More AS, Rana DP. An experimental assessment of random forest classification performance improvisation with sampling and stage wise success rate calculation. Procedia Comput Sci. 2020;167:1711–21.10.1016/j.procs.2020.03.381Suche in Google Scholar

[50] Karim AAA, Sameer RA. Image classification using bag of visual words (bovw). Al-Nahrain J Sci. 2018;21(4):76–82.10.22401/ANJS.21.4.11Suche in Google Scholar

[51] Vinoharan V, Ramanan A. An efficient BoF representation for object classification. ELCVIA Electron Lett Comput Vis Image Anal. 2021;20(2):51–68.10.5565/rev/elcvia.1403Suche in Google Scholar

[52] Chebbout S, Merouani HF. A hybrid codebook model for object categorization using two-way clustering based codebook generation method. Int J Comput Appl. 2022;44(2):178–86.10.1080/1206212X.2020.1712775Suche in Google Scholar

Received: 2023-04-23

Revised: 2023-11-02

Accepted: 2023-11-07

Published Online: 2024-03-07

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

https://doi.org/10.1515/jisys-2023-0054

Schlagwörter für diesen Artikel

CNNs; BoVW; image classification; deep learning; SVM

Creative Commons

BY 4.0