Towards a better similarity algorithm for host-based intrusion detection system

Lounis Ouarda; Bourenane Malika; Bouderah Brahim

doi:10.1515/jisys-2022-0259

Article Open Access

Towards a better similarity algorithm for host-based intrusion detection system

, and

Published/Copyright: April 10, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 32 Issue 1

Abstract

An intrusion detection system plays an essential role in system security by discovering and preventing malicious activities. Over the past few years, several research projects on host-based intrusion detection systems (HIDSs) have been carried out utilizing the Australian Defense Force Academy Linux Dataset (ADFA-LD). These HIDS have also been subjected to various algorithm analyses to enhance their detection capability for high accuracy and low false alarms. However, less attention is paid to the actual implementation of real-time HIDS. Our principal objective in this study is to create a performant real-time HIDS. We propose a new model, “Better Similarity Algorithm for Host-based Intrusion Detection System” (BSA-HIDS), using the same dataset ADFA-LD. The proposed model uses three classifications to represent the attack folder according to certain criteria, the entire system call sequence is used. Furthermore, this work uses textual distance and compares five algorithms like Levenshtein, Jaro–Winkler, Jaccard, Hamming, and Dice coefficient, to classify the system call trace as attack or non-attack based on the notions of interclass decoupling and intra-class coupling. The model can detect zero-day attacks because of the threshold definition. The experimental results show a good detection performance in real-time for Levenshtein/Jaro–Winkler algorithms, 99–94% in detection rate, 2–5% in false alarm rate, and 3,300–720 s in running time, respectively.

Keywords: ADFA-LD; Levenshtein; Jaro–Winkler; BSA-HIDS; zero-day attack; system call sequences; similarity

MSC 2010: 68M25

1 Introduction

Concern over the frequency of cyberattacks has grown recently due to the Internet’s rapid development. Around 32% of businesses and 22% of charities in the United Kingdom alone reported experiencing a cyberattack in 2019 [1]. Such attacks can be found using an intrusion detection system (IDS). Despite the tremendous success of the established intrusion detection methods, there has been an increase in interest in either enhancing the current methods or developing new ones [2].

It is much more difficult to identify zero-day attacks (attacks that have never been seen before), as no pattern or signature can be utilized to identify them. In addition, the rate of production of system call sequences data required to access, control, or manage connected devices increase rapidly. Therefore, there is a growing demand for effective intrusion detection algorithms that recognize, isolate, and handle suspicious patterns in system call sequences.

Effectively, a system call is a programmatic method through which a computer application asks the kernel of the operating system it runs on for a service. It offers a point of contact between processes and the operating system so that user-level processes can ask for its services. A trace representing the monitored process’s behavior is thus created by a series of system calls, which correspond to the sequential list of service requests provided by a process to the kernel. A well-known example of a dataset containing system call traces is the Australian Defense Force Academy Linux Dataset (ADFA-LD). The Australian Defense Force Academy provides it. The ADFA-LD dataset has been used in numerous papers for intrusion detection. Specifically, the ADFA-LD dataset was created to assess anomaly detection and system call-based host-based intrusion detection system (HIDS).

In contrast to many other datasets used to evaluate HIDSs, the ADFA-LD is based on Linux local servers and is composed of thousands of system call traces for the most recent attacks and vulnerabilities in various applications. It also reflects the features of current Linux-based operating systems. Given this, it is anticipated that the ADFA-LD would establish a new benchmark for evaluating HIDS. According to Marteau [3], the following factors make it difficult to identify anomalous system call sequences in ADFA-LD:

The anomalies are context-dependent. The context of a system call’s existence, or more specifically, the system calls that come before and after it, can be used to determine whether it is abnormal. A sequence taken as a whole may be judged as abnormal.
Because system call sequence can vary in length and the alphabet used is relatively large, system call sequence variability is very high (more than 300 system call for the Linux system).

To compare two sequences side by side or to offer several sequence alignments for a set of sequences, many different similarity measures have been proposed by bioinformatics over the years. Indeed, the similarity is a vague concept that can only be treated quantitatively using an appropriate mathematical representation of the objects to be compared and a comparison metric.

Data similarity analysis nowadays gathers numerous methods and tools to discover the “essential” information in a text to identify the elements and find their similarities and differences. The way to check the similarity between a data point or data group is to calculate the distance between these data points. In textual data, we also check the similarity between strings by calculating the distance between one text and another. They have successfully been customized to handle a sequence of system calls.

To address the limitations cited in the paragraphs above, we describe in this study a new prototype BSA-HIDS, which stands for better similarity algorithm for host-based intrusion detection system. We use system call sequences represented in the ADFA-LD benchmark; each sequence, regardless of its length, is taken into account as a whole. As far as we know, this technique has not yet been proposed for system call sequence comparison, particularly in intrusion detection. It describes how many sequences are closer to concluding whether it is an attack or normal. The following are this study’s contributions:

The whole dataset ADFA-LD is used; that is to say, the whole system call sequences are used and do not depend on window size. Indeed, if the system call sequence contains 242 system calls, we take the entire sequence in this model, unlike other models, which take just the first 100 system calls or a specified window size.
The performances of the three classifications in terms of detection rate, false positive rate, and false negative rate using a comparison of five similarity measures are almost identical. This means that whatever the classification of attacks is, the model performs very well due to how the threshold is defined (the threshold is variable; it is calculated for each sequence).
The preprocessing time to perform the three classifications is negligible, about 3 s (training time).
There are no parameters to learn.
The proposed model can detect zero-day attacks.
The performance of the anomaly IDS is improved with a lower false negative rate of 0.04 and 0.0 false positive, a higher accuracy of 0.99 in a short running time of 3,300 s by the BSA algorithm running on ten processor cores using the Levenshtein algorithm. This result is better compared to recent works [3,4].

Section 2 of this article briefly reports the related key works. We detail the edit distance-based algorithms in Section 3. We describe the methodology in Section 4 discussing the experimental dataset, which is the system call sequence data released by Creech and Hu [5], the data preprocessing using three attack classifications, and finally, we explain the principle detection used in this study. In Section 5, we discuss the results and Section 6, we provide the conclusion and directions for future research.

2 Related work

Computer security has become essential in protecting the integrity of information technology, such as computer systems, networks, and data from attack, damage, or unauthorized access. Several types of research have been conducted in this broad and multifaceted paradigm. Pavithran et al. [6] proposed a novel encryption process to protect a system from attacks. It is based on Deoxyribonucleic acid (DNA) cryptography, a hyperchaotic system, and a Moore machine. Namasudra [7] proposed a novel cryptosystem using DNA cryptography and DNA steganography for the cloud-based IoT infrastructure. Das and Namasudra [8] proposed a novel ciphertext policy attribute-based encryption (CP-ABE)-based fine-grained access control scheme to solve the attribute revocation problem in the CP-ABE technique utilized very often in an IoT-based healthcare system for encrypting patients’ healthcare data. Also, many other research works [9,10,11] address the security of systems.

In this work, we are interested in intrusions. Effectively, a device that monitors a system for potential intrusions is called an IDS; it is a crucial tool for detecting security violations in real time. If the detection occurs on a network, the IDS will be referred to as a NIDS, and if it occurs on a host, it will be referred to as an HIDS. Furthermore, we differentiate between two approaches. Signature-based intrusion detection approach looks for attacks by looking for predefined specific patterns, such as known malicious sequences of instructions used by malware or byte sequences in network packets, and the anomaly-based intrusion detection approach was primarily developed to detect unknown attacks (zero-day attacks). Both approaches have weaknesses: anomaly-based IDS is criticized for producing many false alarms. In contrast, signature-based IDS is criticized for not being able to identify zero-day attacks.

Many system calls-based anomaly detection models have been developed to increase detection rates and reduce false alarm rates in HIDS [2,3,12,13,14,15,16]. If we restrict the focus to anomaly detection in sequential data, we find that four basic approaches, according to Marteau [3], are taken to treat symbolic sequences: Window-based approaches, Global kernel-based approaches, Generative approaches, and Language-based approaches.

The first approach uses a fixed-window size defined in advance, and the window slides along the sequence progressively. This one is the most widely used of the four methods mentioned above, and its popularity is due to many machine learning and statistical knowledge-based techniques that can be used [14,15]. However, the second approach uses the whole sequence. Effectively, a similarity measure is used for each pair-wise sequence to give the sequence’s distance. These methods have origins in bioinformatics or text processing.

The third approach generally includes Recurrent neural network (RNN), Long short-term memory (LSTM), and Hidden Markov model (HMM), which have all been employed successfully on various intrusion detection tasks [17]. Therefore, HMM algorithms have been criticized for being difficult to calculate and for the poor performance resulting from their brief dependence on initial system calls. The last method was initially proposed to improve a vector space model by separating essential n-gram features. Recently, a much more ambitious model suggested creating phrases, sentences, and, ultimately, a language out of sequences of system calls [5]. The proposed model fits into the “Global kernel-based approaches.”

This section reviews a few useful strategies researchers have suggested during the last 10 years, especially those enabling system call analysis. All these works are used in one of the four approaches cited above. Moreover, Table 1 highlights the successes and shortcomings of each of those works.

Table 1

Comparison of HIDS in the related work

Article	Dataset	Technique	Strength	Weakness	Results	detection time
[18]	ADFA-LD	RNN	Can predict a sequence of system-calls that will be executed in the future	False alarm rate is not discussed	Accuracy of 96%	Training time of 48 h but detection time is not discussed
[3]	ADFA-LD UNM	Uses similarity notion	Introduces a new similarity measure, SC4ID, which stands for sequence covering for Intrusion detection to compare a symbolic system call sequence and a set of symbolic system call sequences	False alarm rate is not discussed	Accuracy of 90%	900 s for ADFA-LD dataset and 454 s for UNM dataset
[4]	ADFA-LD	Anomaly detection algorithm using distinct short-sequences extraction from system call traces	- Detection of zero-day attacks - Since it can learn quickly and gradually, it is adaptable to deal with any environmental changes without having to completely rebuild the classifier.	- The false alarm rate needs to be decreased by improving the extraction and the classification algorithms. - The abnormality threshold value is determined empirically and it has to be determined automatically.	90.48% detection rate, 22.5% false alarm rate	- Learning time of about 30 seconds. - No detection time is discussed.
[13]	ADFA-LD	- Construct embedding vectors for all system calls; - Model the sequences with system call embedding and weighing	To improve detection performance, the sequence embedding model presented in this research is the first to convert system call sequences into embedding vectors which shows a good performance	To make this model efficient, the running or at least the detection time must be discussed	- False positive rate of 5.3%. - True positive rate of 91.7%	Not discussed
[14]	ADFA-LD	Convolution neural network (CNN) with LSTM. They use a sized window to define the system call sequence.	A high detection rate	False alarm rate is not discussed	Accuracy of 96%	Not discussed

A good and performant IDS, whether it is a HIDS or NIDS, must be able to provide at least results for these metrics: accuracy, false alarm rate, and time of detection. As seen in Table 1, most of the papers cited give results for just one or two of these metrics. In Section 3, we will give a new BSA-HIDS model that addresses all the metrics, and we will show its effectiveness compared to the recent works cited in this table.

3 Edit distance-based algorithms

There are several algorithms for calculating the distance between texts, and the computational strategy of these algorithms differs according to their views of the string. Thus, they are sorted into four categories: the first includes those based on calculating the editing distance (character by character), and the second includes using words (tokens). The third category includes those based on word sequences, and the fourth includes those based on phonetic meaning [19]. We are interested in the first two categories.

The first category of algorithms determines how many steps must be taken to transform one string into another, as the number of operations rises, the similarity between the two strings declines. Therefore, a set of tokens (words) is required as input in the second category rather than full strings. The more common tokens there are between the two sets, the more similar they are to one another. We will examine here the following algorithms.

3.1 Hamming distance

The Hamming distance is equivalent to the minimum number of substitutions required to move from the representation of string 1 to that of string 2 . The substitution corresponds to replacing an element in the representation of string 1 with a new element to get closer to the representation of string 2 [20].

Let E be an alphabet of symbols and C a subset of E n , the set of words of length n over E . Let A = ( a 1 , … , a 2 ) and B = ( b 1 , … , b 2 ) be words in C . The Hamming distance d ( A , B ) is defined as the number of places in which A and B differ, that is

d ( A , B ) = { i : a i ≠ b i , i = 1 , … . , n }

The Hamming distance satisfies

d ( A , B ) ≥ 0 and d ( A , B ) = 0 if and only if A = B ,

d ( A , B ) = d ( B , A ) ,

d ( A , B ) ≤ d ( A , w ) + d ( w , B ) .

3.2 Levenshtein distance

The number of adjustments needed to convert one string into another is counted to determine this distance. This algorithm modifies the first string to match the second one using insertion, deletion, and replacement. The Levenshtein distance between two strings A and B is given by Lev A , B ( | A | , | B | ) [21], where:

Lev A , B ( | A | , | B | ) = max ( | A | , | B | ) , if min ( | A | , | B | ) = 0 min Lev A , B ( | A | − 1 , | B | ) + 1 Lev A , B ( | A | , | B | − 1 ) + 1 Otherwise Lev A , B ( | A | − 1 , | B | − 1 ) + 1 ( A i ≠ B i ) .

3.3 Jaro–Winkler distance

Two strings receive high scores from this method, if (1) they have matching characters close to one another and (2) the matching characters are in the same order. The Winkler algorithm, therefore, increases the Jaro similarity measure for equivalent initial characters.

sim JW ( A , B ) = sim j ( A , B ) + l * p * ( 1 − sim j ( A , B ) ) ,

where sim jw ( A , B ) is the similarity of Jaro–Winkler, sim j ( A , B ) is the similarity of Jaro, l is the length of the same prefix at the beginning of both strings, up to a maximum of 4. p is used as the scalar. The scaling factor must not be greater than 0.25. If not, the similarity could go beyond 1 because the prefix being considered can only be four letters long. Original Winkler’s work used a value of 0.1.

sim j ( A , B ) = 0 if m = 0 1 3 m | A | m | B | m − t m otherwise ,

where m is the number of matching characters. Two characters from A and B are identical if they do not differ by more than max ( | A | , | B | ) 2 − 1 characters apart [19].

3.4 Jaccard index

For this case of set similarity, the approach is to find the number of common tokens between two sets and divide it by the total number of unique tokens. It is described mathematically as follows [22]:

J ( A , B ) = | A ∩ B | | A ∪ B | = | A ∩ B | [ A | + | B | − | A ∩ B | ,

where A and B are two strings that have to be tokenized by the user. In our defined prototype, we tokenize the sequence of system call contained in the attack folder using space as a delimiter converting system call numbers to tokens.

3.5 Dice coefficient

For this case of set similarity, the approach is to combine the two sets and look for the common tokens, then divide those by the total number of tokens.

DSC = 2 * | A ∩ B | | A | + | B | .

It is based on the idea that if a token appears in both strings, its total count must be twice the intersection when the intersection of two sets of strings is doubled in the numerator (which removes duplicates). The tokens in both strings are combined to form the denominator. Recall that the denominator of Jaccard’s equation was the union of two strings; this one is very different. Like intersections, the union also eliminates duplicates, and the dice mechanism prevents this. Dice will constantly overstate how similar two strings are [23].

4 Methodology

This section outlines the numerous steps to implement the proposed HIDS. The experimental dataset used in this work and its preprocessing is described utilizing three attack folder classifications described in Section 4.1. The BSA used in this study is described in Section 4.2.

Based on five metrics or similarity algorithms, BSA categorizes system call traces as either normal or attack data. A simplified systematic description of the method employed in the suggested study is shown in Figure 1.

Figure 1

Representation for the proposed anomaly detection system BSA-HIDS.

4.1 ADFA-LD dataset and its preprocessing

Linux dataset ADFA-LD was created by Creech and Hu [5] using an auditing tool named auditd.

It was compiled using the fully patched Ubuntu 11.04 operating system and kernel 2.6.38. Numerous services, including a web server, database server, SSH server, FTP server, etc., are being run by the operating system to capture sequences that represent attack and normal sequences.

As shown in Table 2, the ADFA-LD is divided into three distinct data folders; each folder has its system call trace files. The Training data master (TDM) folder and Validation data master (VDM) folder represent the normal data. On the other hand, Attack data master (ADM) represents attack data. ADM folder includes six other attack data types: “Adduser,” “Hydra-FTP,” “Hydra-SSH,” “JavaMeterpreter,” “Meterpreter,” and “Web shell.”

Table 2

Number of system calls traces in different categories of ADFA-LD [5]

	Traces	System calls
Training data	833	308,077
Validation data	4,372	2,122,085
Attack data	746	317,388
Total	5,951	2,747,550

The preprocessing in this approach consists of dividing the ADM folder into a set of groups. The other two TDM/VDM folders are not divided. After thoroughly studying the ADM files, we should consider the classifications in Figure 2, Tables 3 and 4 below.

Figure 2

Folders in classification 1 for ADM.

Table 3

Folders in classification 2 for ADM

ADM
Adduser	HydraFTP	Hydra-SSH	Java-Meterpreter	Meterpreter	Web shell
Adduser-Different	FTP-Different	SSH-Different	Java-Different	Meterpreter-Different	WSDifferent
Adduser961	HydraFTP961	Hydra-SSH961	Java-Meterpreter961	Meterpreter961	WS961
Adduser1371	HydraFTP1371	Hydra-SSH1371	Java-Meterpreter1371	Meterpreter1371	WS1371
Adduser1613	HydraFTP1442	Hydra-SSH1442	Java-Meterpreter1613	Meterpreter1533	WS1613
Adduser2311	HydraFTP1613	Hydra-SSH1613	Java-Meterpreter2311	Meterpreter1613	WS2311
Adduser2377	HydraFTP2311	Hydra-SSH2311	Java-Meterpreter2377	Meterpreter2293	WS2462
Adduser2462	HydraFTP2462	Hydra-SSH2424	Java-Meterpreter2462	Meterpreter2311	WS2783
Adduser2783	HydraFTP2783	Hydra-SSH2462	Java-Meterpreter2783	Meterpreter2462	WS4548
Adduser = 1	HydraFTP3523	Hydra-SSH2783	Java-Meterpreter = 1	Meterpreter2783	WS4568
	HydraFTP3524	Hydra-SSH = 5738			WS4569
	HydraFTP8978				WS4605
	HydraFTP9300				WS4609
	HydraFTP11504				WS961
	HydraFTP13541				WS1371

Table 4

Folders in classification 3 for ADM

Different	2462	5738	13541	11504	8978
=1	2783	1533	2424	4605	9300
961	3523	2293	2377	4568	1442
1371	3524	4548	4609	4569	1613
					2311

Note that folder “Different” regroups all the traces that are not contained in other folders.

It was noticed that all the system call traces with the same number at the end of the file name have the same system calls but with different occurrences. Therefore, classifications 2 and 3 were proposed based on this observation.

In classification 2, we partitioned the files by each type of attack. Moreover, for each type of attack, we then partitioned it into a set of groups. Each group contains files that are similar in name, as shown in Table 3. The same principle was used to define the set of groups in classification 3. However, they were defined concerning the whole set of attacks this time. Table 4 shows this distribution.

4.2 Principle of detection

The BSA normal and abnormal behavior detection algorithm is based on the following principles:

The measures of the similarity algorithms are normalized in such a way that a similarity between two strings that approaches 0 means that the two strings are similar. Moreover, in contrast, a similarity between two strings close to 1 means that the two are different. This translates, in our case, into the coupling and decoupling factors.
A trace attack that we want to test must be close to all the other trace attacks (the class to which it belongs depends on the classification chosen), which translates into a very low similarity. On the other hand, its similarity to the set of valid and training data must be very high.
A valid trace we want to test must be close to all the other valid and training traces, resulting in a very low similarity. But, its similarity to all attack traces must be very high.
The threshold by which the test is carried out is variable, relative to each trace.

Let

S be the set of all traces,

S A be the s et of all attack traces S A ⊂ S

S T be the set of all training traces S T ⊂ S

S V be the set of all validation traces S V ⊂ S

We call M / M ⊂ S A the set of all possible multisets we can create from Classification 1, Classification 2, and Classification 3.

m i , n ⊂ M is called the set of attacks number i containing n traces.

Algorithm One-to-All Trace similarity Algorithm (OATSA) calculates the average similarity of one trace to a set of trace.

Algorithm 1 : OATSA

Input : x single trace
Input : X set of traces
Output : float
s ← 0 ;
For i in range ( 0 , |X| ) do
s ← s + sim ( x , X [ i ] )
end For
r eturn ( s / |X| )

Note that in this algorithm sim can be one of the similarity measure listed above (Levenshtein, Jaro–Winkler, Jaccard, Hamming, or Dice coefficient).

Algorithm 2 : BSA

Input : S A , S T , S V
Input : x system call trace to test ( x ∊ m i ) or ( x ∊ S V )
Output : a decision value ‘ normal ’ or ‘ anomaly ’
Calculate :
a ← OATSA ( x , m i )
b ← OATSA ( x , S T )
c ← OATSA ( x , S V )
if ( a < max ( b , c ) ) then
return ‘ anomaly ’
else if ( a ≥ min ( b , c ) ) then
return ‘ normal ’
end if

where max (b,c) is the function that returns the maximum between b and c ,

min (b,c) is the function that returns the minimum between two numbers b and c .

Note that if ( a < max ( b , c ) ) is true that means we detect an attack, thus true positive (TP) value increments otherwise false negative (FN) value increments. In addition, if ( a ≥ min ( b , c ) ) is true then, true negative (TN) increments otherwise false positive (FP) increments. The two test rules in the BSA algorithm are defined based on the concept of inter-class (decoupling factor) and intra-class (coupling factor).

The intra-class distance corresponds to the distance between attacks placed in an attack set m i . A small intra-class distance between attacks belonging to the same set m i can be translated in this case by the presence of the same system call numbers in the traces of this set or by the almost identical sequence of these system call numbers. The notion of intra-class distance makes it possible to highlight the heterogeneity of the sets m i resulting from classifications 1, 2, and 3 in order to choose the best classification. The inter-class distance corresponds to the distance between the sets S A , S T , S V within the whole space of the system call traces S . The further apart they are, the stronger the inter-class distance will be.

Therefore, the distance between an attack and the set of all attacks must be less than the maximum of the average distance of that attack from all traces contained in the set S T and its average distance from all traces contained in the set S V . On the other hand, the distance between a normal trace and the set of all attacks must be greater than or equal to the minimum of the average distance of that normal trace from all traces contained in the set S T and its average distance from all traces contained in the set S V .

5 Experiments and results

5.1 Experimental environment

Experiments used Python under VMware with Ubuntu 20.04.5 LTS Linux 64-bit, 24 GB in memory and 10 cores processors. The Levenshtein and Jaro–Winkler algorithms are from the Python rapidfuzz library, and the other three algorithms are from the Python textdistance library. All these algorithms are normalized so that a value that approaches 0 means that the sequences are similar, and a value that approaches 1 means that the sequences are not similar.

5.2 Evaluation metrics

With the use of the sub-metrics TPs, TNs, FPs, and FNs, we assessed and examined each classification’s performance using a confusion matrix, accuracy, FNR, FPR, precision, recall, and F1-score. For completeness, these measures are defined as follows:

Confusion matrix: Demonstrates how many accurate and inaccurate predictions a model made. It considers all factors and can visually display results for each factor, making it a common evaluation tool, especially when attempting to comprehend and enhance an algorithm’s performance.

Accuracy: The proportion of accurate classification predictions to all predictions

Accuracy = TP + TN TP + TN + FN + FP

False negative rate (FNR): The proportion of actual positive examples for which the model incorrectly predicted the negative class.

FNR = FN FN + TP

False positive rate (FPR): The proportion of actual negative examples for which the model incorrectly predicted the positive class.

FPR = FP FP + TN

Precision: Defined as percentage of correct prediction when the model predicted the positive class.

Precision = TP TP + FP

Recall: Define percentage from the real attack instances covered by the model.

Recall = TP TP + FN

F 1-score: Combine the precision and recall metrics into a single metric.

F 1 − Score = 2 * Precision * Recall Precision + Recall

The metrics provided above directly indicate each classifier’s performance.

5.3 Results and discussion

To avoid having many graphs, we will present the results in this section concerning classifications 1, 2, and 3.

All distances used are normalized as follows: Let S 1 , S 2 be two system call sequences.

Sim Levenshtein ( S 1 , S 2 ) = rapidfuzz . string_metric . levenshtein ( S 1 , S 2 ) ) / max ⁡ ( | S 1 | , | S 2 | ) ,

Sim Jaro − Winkler ( S 1 , S 2 ) = ( 100 − ( rapidfuzz . string_metric . jaro_winkler_similarity ( S 1 , S 2 ) ) ) / 100 ,

Sim Jaccard ( S 1 , S 2 ) = 1 − ( textdistance . jaccard . normalized_similarity ( S 1 . split ( ) , S 2 . split ( ) ) ) ,

Sim Hamming ( S 1 , S 2 ) = 1 − ( textdistance . hamming . normalized_similarity ( S 1 , S 2 ) ) ,

Sim Dice − Coefficient ( S 1 , S 2 ) = 1 − ( textdistance . sorensen ( S 1 . split ( ) , S 2 . split ( ) ) ) ,

where

Sim Levenshtein is the similarity using Levenshtein algorithm,

Sim Jaro − Winkler is the similarity using Jaro–Winkler,

Sim Jaccard is the similarity using Jaccard,

Sim Hamming is the similarity using Hamming,

Sim Dice − Coefficient is the similarity using Dice coefficient,

Rapidfuzz and textdistance are libraries in Python.

Tables 5–9 and Figures 3–7 show the performance in terms of accuracy, FNR, FPR, recall, precision, and F1-score for all three classifications under Levenshtein, Jaro–Winkler, Hamming, Jaccard, and Dice coefficient. In the first three similarity measures, we notice that classification 2 is better than the others (1 and 3). However, in the last two similarity measures, algorithms perform better with the last classification (classification 3).

Table 5

Levenshtein result on all three classifications

	Accuracy	FNR	Precision	Recall	F1-score
Classification 1	0.97	0.15	1	0.84	0.91
Classification 2	0.99	0.04	1	0.95	0.97
Classification 3	0.98	0.06	1	0.93	0.96