Skip to main content
Article Open Access

Towards a better similarity algorithm for host-based intrusion detection system

  • EMAIL logo , and
Published/Copyright: April 10, 2023
Become an author with De Gruyter Brill

Abstract

An intrusion detection system plays an essential role in system security by discovering and preventing malicious activities. Over the past few years, several research projects on host-based intrusion detection systems (HIDSs) have been carried out utilizing the Australian Defense Force Academy Linux Dataset (ADFA-LD). These HIDS have also been subjected to various algorithm analyses to enhance their detection capability for high accuracy and low false alarms. However, less attention is paid to the actual implementation of real-time HIDS. Our principal objective in this study is to create a performant real-time HIDS. We propose a new model, “Better Similarity Algorithm for Host-based Intrusion Detection System” (BSA-HIDS), using the same dataset ADFA-LD. The proposed model uses three classifications to represent the attack folder according to certain criteria, the entire system call sequence is used. Furthermore, this work uses textual distance and compares five algorithms like Levenshtein, Jaro–Winkler, Jaccard, Hamming, and Dice coefficient, to classify the system call trace as attack or non-attack based on the notions of interclass decoupling and intra-class coupling. The model can detect zero-day attacks because of the threshold definition. The experimental results show a good detection performance in real-time for Levenshtein/Jaro–Winkler algorithms, 99–94% in detection rate, 2–5% in false alarm rate, and 3,300–720 s in running time, respectively.

MSC 2010: 68M25

1 Introduction

Concern over the frequency of cyberattacks has grown recently due to the Internet’s rapid development. Around 32% of businesses and 22% of charities in the United Kingdom alone reported experiencing a cyberattack in 2019 [1]. Such attacks can be found using an intrusion detection system (IDS). Despite the tremendous success of the established intrusion detection methods, there has been an increase in interest in either enhancing the current methods or developing new ones [2].

It is much more difficult to identify zero-day attacks (attacks that have never been seen before), as no pattern or signature can be utilized to identify them. In addition, the rate of production of system call sequences data required to access, control, or manage connected devices increase rapidly. Therefore, there is a growing demand for effective intrusion detection algorithms that recognize, isolate, and handle suspicious patterns in system call sequences.

Effectively, a system call is a programmatic method through which a computer application asks the kernel of the operating system it runs on for a service. It offers a point of contact between processes and the operating system so that user-level processes can ask for its services. A trace representing the monitored process’s behavior is thus created by a series of system calls, which correspond to the sequential list of service requests provided by a process to the kernel. A well-known example of a dataset containing system call traces is the Australian Defense Force Academy Linux Dataset (ADFA-LD). The Australian Defense Force Academy provides it. The ADFA-LD dataset has been used in numerous papers for intrusion detection. Specifically, the ADFA-LD dataset was created to assess anomaly detection and system call-based host-based intrusion detection system (HIDS).

In contrast to many other datasets used to evaluate HIDSs, the ADFA-LD is based on Linux local servers and is composed of thousands of system call traces for the most recent attacks and vulnerabilities in various applications. It also reflects the features of current Linux-based operating systems. Given this, it is anticipated that the ADFA-LD would establish a new benchmark for evaluating HIDS. According to Marteau [3], the following factors make it difficult to identify anomalous system call sequences in ADFA-LD:

  • The anomalies are context-dependent. The context of a system call’s existence, or more specifically, the system calls that come before and after it, can be used to determine whether it is abnormal. A sequence taken as a whole may be judged as abnormal.

  • Because system call sequence can vary in length and the alphabet used is relatively large, system call sequence variability is very high (more than 300 system call for the Linux system).

To compare two sequences side by side or to offer several sequence alignments for a set of sequences, many different similarity measures have been proposed by bioinformatics over the years. Indeed, the similarity is a vague concept that can only be treated quantitatively using an appropriate mathematical representation of the objects to be compared and a comparison metric.

Data similarity analysis nowadays gathers numerous methods and tools to discover the “essential” information in a text to identify the elements and find their similarities and differences. The way to check the similarity between a data point or data group is to calculate the distance between these data points. In textual data, we also check the similarity between strings by calculating the distance between one text and another. They have successfully been customized to handle a sequence of system calls.

To address the limitations cited in the paragraphs above, we describe in this study a new prototype BSA-HIDS, which stands for better similarity algorithm for host-based intrusion detection system. We use system call sequences represented in the ADFA-LD benchmark; each sequence, regardless of its length, is taken into account as a whole. As far as we know, this technique has not yet been proposed for system call sequence comparison, particularly in intrusion detection. It describes how many sequences are closer to concluding whether it is an attack or normal. The following are this study’s contributions:

  • The whole dataset ADFA-LD is used; that is to say, the whole system call sequences are used and do not depend on window size. Indeed, if the system call sequence contains 242 system calls, we take the entire sequence in this model, unlike other models, which take just the first 100 system calls or a specified window size.

  • The performances of the three classifications in terms of detection rate, false positive rate, and false negative rate using a comparison of five similarity measures are almost identical. This means that whatever the classification of attacks is, the model performs very well due to how the threshold is defined (the threshold is variable; it is calculated for each sequence).

  • The preprocessing time to perform the three classifications is negligible, about 3 s (training time).

  • There are no parameters to learn.

  • The proposed model can detect zero-day attacks.

  • The performance of the anomaly IDS is improved with a lower false negative rate of 0.04 and 0.0 false positive, a higher accuracy of 0.99 in a short running time of 3,300 s by the BSA algorithm running on ten processor cores using the Levenshtein algorithm. This result is better compared to recent works [3,4].

Section 2 of this article briefly reports the related key works. We detail the edit distance-based algorithms in Section 3. We describe the methodology in Section 4 discussing the experimental dataset, which is the system call sequence data released by Creech and Hu [5], the data preprocessing using three attack classifications, and finally, we explain the principle detection used in this study. In Section 5, we discuss the results and Section 6, we provide the conclusion and directions for future research.

2 Related work

Computer security has become essential in protecting the integrity of information technology, such as computer systems, networks, and data from attack, damage, or unauthorized access. Several types of research have been conducted in this broad and multifaceted paradigm. Pavithran et al. [6] proposed a novel encryption process to protect a system from attacks. It is based on Deoxyribonucleic acid (DNA) cryptography, a hyperchaotic system, and a Moore machine. Namasudra [7] proposed a novel cryptosystem using DNA cryptography and DNA steganography for the cloud-based IoT infrastructure. Das and Namasudra [8] proposed a novel ciphertext policy attribute-based encryption (CP-ABE)-based fine-grained access control scheme to solve the attribute revocation problem in the CP-ABE technique utilized very often in an IoT-based healthcare system for encrypting patients’ healthcare data. Also, many other research works [9,10,11] address the security of systems.

In this work, we are interested in intrusions. Effectively, a device that monitors a system for potential intrusions is called an IDS; it is a crucial tool for detecting security violations in real time. If the detection occurs on a network, the IDS will be referred to as a NIDS, and if it occurs on a host, it will be referred to as an HIDS. Furthermore, we differentiate between two approaches. Signature-based intrusion detection approach looks for attacks by looking for predefined specific patterns, such as known malicious sequences of instructions used by malware or byte sequences in network packets, and the anomaly-based intrusion detection approach was primarily developed to detect unknown attacks (zero-day attacks). Both approaches have weaknesses: anomaly-based IDS is criticized for producing many false alarms. In contrast, signature-based IDS is criticized for not being able to identify zero-day attacks.

Many system calls-based anomaly detection models have been developed to increase detection rates and reduce false alarm rates in HIDS [2,3,12,13,14,15,16]. If we restrict the focus to anomaly detection in sequential data, we find that four basic approaches, according to Marteau [3], are taken to treat symbolic sequences: Window-based approaches, Global kernel-based approaches, Generative approaches, and Language-based approaches.

The first approach uses a fixed-window size defined in advance, and the window slides along the sequence progressively. This one is the most widely used of the four methods mentioned above, and its popularity is due to many machine learning and statistical knowledge-based techniques that can be used [14,15]. However, the second approach uses the whole sequence. Effectively, a similarity measure is used for each pair-wise sequence to give the sequence’s distance. These methods have origins in bioinformatics or text processing.

The third approach generally includes Recurrent neural network (RNN), Long short-term memory (LSTM), and Hidden Markov model (HMM), which have all been employed successfully on various intrusion detection tasks [17]. Therefore, HMM algorithms have been criticized for being difficult to calculate and for the poor performance resulting from their brief dependence on initial system calls. The last method was initially proposed to improve a vector space model by separating essential n-gram features. Recently, a much more ambitious model suggested creating phrases, sentences, and, ultimately, a language out of sequences of system calls [5]. The proposed model fits into the “Global kernel-based approaches.”

This section reviews a few useful strategies researchers have suggested during the last 10 years, especially those enabling system call analysis. All these works are used in one of the four approaches cited above. Moreover, Table 1 highlights the successes and shortcomings of each of those works.

Table 1

Comparison of HIDS in the related work

Article Dataset Technique Strength Weakness Results detection time
[18] ADFA-LD RNN Can predict a sequence of system-calls that will be executed in the future False alarm rate is not discussed Accuracy of 96% Training time of 48 h but detection time is not discussed
[3] ADFA-LD UNM Uses similarity notion Introduces a new similarity measure, SC4ID, which stands for sequence covering for Intrusion detection to compare a symbolic system call sequence and a set of symbolic system call sequences False alarm rate is not discussed Accuracy of 90% 900 s for ADFA-LD dataset and 454 s for UNM dataset
[4] ADFA-LD Anomaly detection algorithm using distinct short-sequences extraction from system call traces
  • - Detection of zero-day attacks

  • - Since it can learn quickly and gradually, it is adaptable to deal with any environmental changes without having to completely rebuild the classifier.

  • - The false alarm rate needs to be decreased by improving the extraction and the classification algorithms.

  • - The abnormality threshold value is determined empirically and it has to be determined automatically.

90.48% detection rate, 22.5% false alarm rate
  • - Learning time of about 30 seconds.

  • - No detection time is discussed.

[13] ADFA-LD
  • - Construct embedding vectors for all system calls;

  • - Model the sequences with system call embedding and weighing

To improve detection performance, the sequence embedding model presented in this research is the first to convert system call sequences into embedding vectors which shows a good performance To make this model efficient, the running or at least the detection time must be discussed
  • - False positive rate of 5.3%.

  • - True positive rate of 91.7%

Not discussed
[14] ADFA-LD Convolution neural network (CNN) with LSTM. They use a sized window to define the system call sequence. A high detection rate False alarm rate is not discussed Accuracy of 96% Not discussed

A good and performant IDS, whether it is a HIDS or NIDS, must be able to provide at least results for these metrics: accuracy, false alarm rate, and time of detection. As seen in Table 1, most of the papers cited give results for just one or two of these metrics. In Section 3, we will give a new BSA-HIDS model that addresses all the metrics, and we will show its effectiveness compared to the recent works cited in this table.

3 Edit distance-based algorithms

There are several algorithms for calculating the distance between texts, and the computational strategy of these algorithms differs according to their views of the string. Thus, they are sorted into four categories: the first includes those based on calculating the editing distance (character by character), and the second includes using words (tokens). The third category includes those based on word sequences, and the fourth includes those based on phonetic meaning [19]. We are interested in the first two categories.

The first category of algorithms determines how many steps must be taken to transform one string into another, as the number of operations rises, the similarity between the two strings declines. Therefore, a set of tokens (words) is required as input in the second category rather than full strings. The more common tokens there are between the two sets, the more similar they are to one another. We will examine here the following algorithms.

3.1 Hamming distance

The Hamming distance is equivalent to the minimum number of substitutions required to move from the representation of string 1 to that of string 2 . The substitution corresponds to replacing an element in the representation of string 1 with a new element to get closer to the representation of string 2 [20].

Let E be an alphabet of symbols and C a subset of E n , the set of words of length n over E . Let A = ( a 1 , , a 2 ) and B = ( b 1 , , b 2 ) be words in C . The Hamming distance d ( A , B ) is defined as the number of places in which A and B differ, that is

d ( A , B ) = { i : a i b i , i = 1 , . , n }

The Hamming distance satisfies

d ( A , B ) 0 and d ( A , B ) = 0 if and only if A = B ,

d ( A , B ) = d ( B , A ) ,

d ( A , B ) d ( A , w ) + d ( w , B ) .

3.2 Levenshtein distance

The number of adjustments needed to convert one string into another is counted to determine this distance. This algorithm modifies the first string to match the second one using insertion, deletion, and replacement. The Levenshtein distance between two strings A and B is given by Lev A , B ( | A | , | B | ) [21], where:

Lev A , B ( | A | , | B | ) = max ( | A | , | B | ) , if min ( | A | , | B | ) = 0 min Lev A , B ( | A | 1 , | B | ) + 1 Lev A , B ( | A | , | B | 1 ) + 1 Otherwise Lev A , B ( | A | 1 , | B | 1 ) + 1 ( A i B i ) .

3.3 Jaro–Winkler distance

Two strings receive high scores from this method, if (1) they have matching characters close to one another and (2) the matching characters are in the same order. The Winkler algorithm, therefore, increases the Jaro similarity measure for equivalent initial characters.

sim JW ( A , B ) = sim j ( A , B ) + l * p * ( 1 sim j ( A , B ) ) ,

where sim jw ( A , B ) is the similarity of Jaro–Winkler, sim j ( A , B ) is the similarity of Jaro, l is the length of the same prefix at the beginning of both strings, up to a maximum of 4. p is used as the scalar. The scaling factor must not be greater than 0.25. If not, the similarity could go beyond 1 because the prefix being considered can only be four letters long. Original Winkler’s work used a value of 0.1.

sim j ( A , B ) = 0 if m = 0 1 3 m | A | m | B | m t m otherwise ,

where m is the number of matching characters. Two characters from A and B are identical if they do not differ by more than max ( | A | , | B | ) 2 1 characters apart [19].

3.4 Jaccard index

For this case of set similarity, the approach is to find the number of common tokens between two sets and divide it by the total number of unique tokens. It is described mathematically as follows [22]:

J ( A , B ) = | A B | | A B | = | A B | [ A | + | B | | A B | ,

where A and B are two strings that have to be tokenized by the user. In our defined prototype, we tokenize the sequence of system call contained in the attack folder using space as a delimiter converting system call numbers to tokens.

3.5 Dice coefficient

For this case of set similarity, the approach is to combine the two sets and look for the common tokens, then divide those by the total number of tokens.

DSC = 2 * | A B | | A | + | B | .

It is based on the idea that if a token appears in both strings, its total count must be twice the intersection when the intersection of two sets of strings is doubled in the numerator (which removes duplicates). The tokens in both strings are combined to form the denominator. Recall that the denominator of Jaccard’s equation was the union of two strings; this one is very different. Like intersections, the union also eliminates duplicates, and the dice mechanism prevents this. Dice will constantly overstate how similar two strings are [23].

4 Methodology

This section outlines the numerous steps to implement the proposed HIDS. The experimental dataset used in this work and its preprocessing is described utilizing three attack folder classifications described in Section 4.1. The BSA used in this study is described in Section 4.2.

Based on five metrics or similarity algorithms, BSA categorizes system call traces as either normal or attack data. A simplified systematic description of the method employed in the suggested study is shown in Figure 1.

Figure 1 
               Representation for the proposed anomaly detection system BSA-HIDS.
Figure 1

Representation for the proposed anomaly detection system BSA-HIDS.

4.1 ADFA-LD dataset and its preprocessing

Linux dataset ADFA-LD was created by Creech and Hu [5] using an auditing tool named auditd.

It was compiled using the fully patched Ubuntu 11.04 operating system and kernel 2.6.38. Numerous services, including a web server, database server, SSH server, FTP server, etc., are being run by the operating system to capture sequences that represent attack and normal sequences.

As shown in Table 2, the ADFA-LD is divided into three distinct data folders; each folder has its system call trace files. The Training data master (TDM) folder and Validation data master (VDM) folder represent the normal data. On the other hand, Attack data master (ADM) represents attack data. ADM folder includes six other attack data types: “Adduser,” “Hydra-FTP,” “Hydra-SSH,” “JavaMeterpreter,” “Meterpreter,” and “Web shell.”

Table 2

Number of system calls traces in different categories of ADFA-LD [5]

Traces System calls
Training data 833 308,077
Validation data 4,372 2,122,085
Attack data 746 317,388
Total 5,951 2,747,550

The preprocessing in this approach consists of dividing the ADM folder into a set of groups. The other two TDM/VDM folders are not divided. After thoroughly studying the ADM files, we should consider the classifications in Figure 2, Tables 3 and 4 below.

Figure 2 
                  Folders in classification 1 for ADM.
Figure 2

Folders in classification 1 for ADM.

Table 3

Folders in classification 2 for ADM

ADM
Adduser HydraFTP Hydra-SSH Java-Meterpreter Meterpreter Web shell
Adduser-Different FTP-Different SSH-Different Java-Different Meterpreter-Different WSDifferent
Adduser961 HydraFTP961 Hydra-SSH961 Java-Meterpreter961 Meterpreter961 WS961
Adduser1371 HydraFTP1371 Hydra-SSH1371 Java-Meterpreter1371 Meterpreter1371 WS1371
Adduser1613 HydraFTP1442 Hydra-SSH1442 Java-Meterpreter1613 Meterpreter1533 WS1613
Adduser2311 HydraFTP1613 Hydra-SSH1613 Java-Meterpreter2311 Meterpreter1613 WS2311
Adduser2377 HydraFTP2311 Hydra-SSH2311 Java-Meterpreter2377 Meterpreter2293 WS2462
Adduser2462 HydraFTP2462 Hydra-SSH2424 Java-Meterpreter2462 Meterpreter2311 WS2783
Adduser2783 HydraFTP2783 Hydra-SSH2462 Java-Meterpreter2783 Meterpreter2462 WS4548
Adduser = 1 HydraFTP3523 Hydra-SSH2783 Java-Meterpreter = 1 Meterpreter2783 WS4568
HydraFTP3524 Hydra-SSH = 5738 WS4569
HydraFTP8978 WS4605
HydraFTP9300 WS4609
HydraFTP11504 WS961
HydraFTP13541 WS1371
Table 4

Folders in classification 3 for ADM

  • Different

  • 2462

  • 5738

  • 13541

  • 11504

  • 8978

  • =1

  • 2783

  • 1533

  • 2424

  • 4605

  • 9300

  • 961

  • 3523

  • 2293

  • 2377

  • 4568

  • 1442

  • 1371

  • 3524

  • 4548

  • 4609

  • 4569

  • 1613

  • 2311

Note that folder “Different” regroups all the traces that are not contained in other folders.

It was noticed that all the system call traces with the same number at the end of the file name have the same system calls but with different occurrences. Therefore, classifications 2 and 3 were proposed based on this observation.

In classification 2, we partitioned the files by each type of attack. Moreover, for each type of attack, we then partitioned it into a set of groups. Each group contains files that are similar in name, as shown in Table 3. The same principle was used to define the set of groups in classification 3. However, they were defined concerning the whole set of attacks this time. Table 4 shows this distribution.

4.2 Principle of detection

The BSA normal and abnormal behavior detection algorithm is based on the following principles:

  • The measures of the similarity algorithms are normalized in such a way that a similarity between two strings that approaches 0 means that the two strings are similar. Moreover, in contrast, a similarity between two strings close to 1 means that the two are different. This translates, in our case, into the coupling and decoupling factors.

  • A trace attack that we want to test must be close to all the other trace attacks (the class to which it belongs depends on the classification chosen), which translates into a very low similarity. On the other hand, its similarity to the set of valid and training data must be very high.

  • A valid trace we want to test must be close to all the other valid and training traces, resulting in a very low similarity. But, its similarity to all attack traces must be very high.

  • The threshold by which the test is carried out is variable, relative to each trace.

Let

S be the set of all traces,

S A be the s et of all attack traces S A S

S T be the set of all training traces S T S

S V be the set of all validation traces S V S

We call M / M S A the set of all possible multisets we can create from Classification 1, Classification 2, and Classification 3.

m i , n M is called the set of attacks number i containing n traces.

Algorithm One-to-All Trace similarity Algorithm (OATSA) calculates the average similarity of one trace to a set of trace.

Algorithm 1 : OATSA

  1. Input : x single trace

  2. Input : X set of traces

  3. Output : float

  4. s 0 ;

  5. For i in range ( 0 , |X| ) do

  6. s s + sim ( x , X [ i ] )

  7. end For

  8. r eturn ( s / |X| )

Note that in this algorithm sim can be one of the similarity measure listed above (Levenshtein, Jaro–Winkler, Jaccard, Hamming, or Dice coefficient).

Algorithm 2 : BSA

  1. Input : S A , S T , S V

  2. Input : x system call trace to test ( x m i ) or ( x S V )

  3. Output : a decision value normal or anomaly

  4. Calculate :

  5. a OATSA ( x , m i )

  6. b OATSA ( x , S T )

  7. c OATSA ( x , S V )

  8. if ( a < max ( b , c ) ) then

  9. return anomaly

  10. else if ( a min ( b , c ) ) then

  11. return normal

  12. end if

where max (b,c) is the function that returns the maximum between b and c ,

min (b,c) is the function that returns the minimum between two numbers b and c .

Note that if ( a < max ( b , c ) ) is true that means we detect an attack, thus true positive (TP) value increments otherwise false negative (FN) value increments. In addition, if ( a min ( b , c ) ) is true then, true negative (TN) increments otherwise false positive (FP) increments. The two test rules in the BSA algorithm are defined based on the concept of inter-class (decoupling factor) and intra-class (coupling factor).

The intra-class distance corresponds to the distance between attacks placed in an attack set m i . A small intra-class distance between attacks belonging to the same set m i can be translated in this case by the presence of the same system call numbers in the traces of this set or by the almost identical sequence of these system call numbers. The notion of intra-class distance makes it possible to highlight the heterogeneity of the sets m i resulting from classifications 1, 2, and 3 in order to choose the best classification. The inter-class distance corresponds to the distance between the sets S A , S T , S V within the whole space of the system call traces S . The further apart they are, the stronger the inter-class distance will be.

Therefore, the distance between an attack and the set of all attacks must be less than the maximum of the average distance of that attack from all traces contained in the set S T and its average distance from all traces contained in the set S V . On the other hand, the distance between a normal trace and the set of all attacks must be greater than or equal to the minimum of the average distance of that normal trace from all traces contained in the set S T and its average distance from all traces contained in the set S V .

5 Experiments and results

5.1 Experimental environment

Experiments used Python under VMware with Ubuntu 20.04.5 LTS Linux 64-bit, 24 GB in memory and 10 cores processors. The Levenshtein and Jaro–Winkler algorithms are from the Python rapidfuzz library, and the other three algorithms are from the Python textdistance library. All these algorithms are normalized so that a value that approaches 0 means that the sequences are similar, and a value that approaches 1 means that the sequences are not similar.

5.2 Evaluation metrics

With the use of the sub-metrics TPs, TNs, FPs, and FNs, we assessed and examined each classification’s performance using a confusion matrix, accuracy, FNR, FPR, precision, recall, and F1-score. For completeness, these measures are defined as follows:

Confusion matrix: Demonstrates how many accurate and inaccurate predictions a model made. It considers all factors and can visually display results for each factor, making it a common evaluation tool, especially when attempting to comprehend and enhance an algorithm’s performance.

Accuracy: The proportion of accurate classification predictions to all predictions

Accuracy = TP + TN TP + TN + FN + FP

False negative rate (FNR): The proportion of actual positive examples for which the model incorrectly predicted the negative class.

FNR = FN FN + TP

False positive rate (FPR): The proportion of actual negative examples for which the model incorrectly predicted the positive class.

FPR = FP FP + TN

Precision: Defined as percentage of correct prediction when the model predicted the positive class.

Precision = TP TP + FP

Recall: Define percentage from the real attack instances covered by the model.

Recall = TP TP + FN

F 1-score: Combine the precision and recall metrics into a single metric.

F 1 Score = 2 * Precision * Recall Precision + Recall

The metrics provided above directly indicate each classifier’s performance.

5.3 Results and discussion

To avoid having many graphs, we will present the results in this section concerning classifications 1, 2, and 3.

All distances used are normalized as follows: Let S 1 , S 2 be two system call sequences.

Sim Levenshtein ( S 1 , S 2 ) = rapidfuzz . string_metric . levenshtein ( S 1 , S 2 ) ) / max ( | S 1 | , | S 2 | ) ,

Sim Jaro Winkler ( S 1 , S 2 ) = ( 100 ( rapidfuzz . string_metric . jaro_winkler_similarity ( S 1 , S 2 ) ) ) / 100 ,

Sim Jaccard ( S 1 , S 2 ) = 1 ( textdistance . jaccard . normalized_similarity ( S 1 . split ( ) , S 2 . split ( ) ) ) ,

Sim Hamming ( S 1 , S 2 ) = 1 ( textdistance . hamming . normalized_similarity ( S 1 , S 2 ) ) ,

Sim Dice Coefficient ( S 1 , S 2 ) = 1 ( textdistance . sorensen ( S 1 . split ( ) , S 2 . split ( ) ) ) ,

where

Sim Levenshtein is the similarity using Levenshtein algorithm,

Sim Jaro Winkler is the similarity using Jaro–Winkler,

Sim Jaccard is the similarity using Jaccard,

Sim Hamming is the similarity using Hamming,

Sim Dice Coefficient is the similarity using Dice coefficient,

Rapidfuzz and textdistance are libraries in Python.

Tables 59 and Figures 37 show the performance in terms of accuracy, FNR, FPR, recall, precision, and F1-score for all three classifications under Levenshtein, Jaro–Winkler, Hamming, Jaccard, and Dice coefficient. In the first three similarity measures, we notice that classification 2 is better than the others (1 and 3). However, in the last two similarity measures, algorithms perform better with the last classification (classification 3).

Table 5

Levenshtein result on all three classifications

Accuracy FNR FPR Precision Recall F1-score
Classification 1 0.97 0.15 0 1 0.84 0.91
Classification 2 0.99 0.04 0 1 0.95 0.97
Classification 3 0.98 0.06 0 1 0.93 0.96

The bold values indicate the line of the classification that gave good results.

Table 6

Jaro–Winkler result on all three classifications

Accuracy FNR FPR Precision Recall F1-score
Classification 1 0.92 0.1 0.06 0.69 0.84 0.76
Classification 2 0.94 0.04 0.06 0.72 0.95 0.82
Classification 3 0.93 0.05 0.06 0.71 0.94 0.81

The bold values indicate the line of the classification that gave good results.

Table 7

Jaccard result on all three classifications

Accuracy FNR FPR Precision Recall F1-score
Classification 1 0.97 0.11 0.01 0.92 0.88 0.90
Classification 2 0.98 0.02 0.01 0.93 0.97 0.95
Classification 3 0.98 0.01 0.01 0.93 0.98 0.95

The bold values indicate the line of the classification that gave good results.

Table 8

Hamming result on all three classifications

Accuracy FNR FPR Precision Recall F1-score
Classification 1 0.97 0.11 0.009 0.93 0.88 0.91
Classification 2 0.98 0.04 0.009 0.94 0.95 0.95
Classification 3 0.97 0.09 0.009 0.94 0.90 0.92

The bold values indicate the line of the classification that gave good results.

Table 9

Dice coefficient result on all three classifications

Accuracy FNR FPR Precision Recall F1-score
Classification 1 0.89 0.13 0.10 0.59 0.86 0.70
Classification 2 0.90 0.04 0.10 0.61 0.95 0.74
Classification 3 0.91 0.01 0.10 0.62 0.98 0.76

The bold values indicate the line of the classification that gave good results.

Figure 3 
                  Levenshtein results on three classifications.
Figure 3

Levenshtein results on three classifications.

Figure 4 
                  Jaro–Winkler results on three classifications.
Figure 4

Jaro–Winkler results on three classifications.

Figure 5 
                  Jaccard results on three classifications.
Figure 5

Jaccard results on three classifications.

Figure 6 
                  Hamming results on three classifications.
Figure 6

Hamming results on three classifications.

Figure 7 
                  Dice coefficient results on three classifications.
Figure 7

Dice coefficient results on three classifications.

We notice that the folder named “Different” in each classification gives a signifying number of false negatives. Let us take, for example, this folder in classification 2 for the JavaMeterpreter attack and using the Levenshtein algorithm, the curve is shown in Figure 8.

Figure 8 
                  Intrusion test curve for file named “Different.”
Figure 8

Intrusion test curve for file named “Different.”

It can be seen from Figure 8 that the attack numbers (5, 10, 15, 20, 25, 26, 31, 42, 43, 44) are poorly detected. This is due to their textual distances, which are close to both the training and validation class and far from the attack class.

Another thing that draws our attention is that if we define a single threshold to classify the system call sequences, we will have a very high false alarm rate. Indeed, let us take the example of Figure 8. If we define a threshold of 0.5, i.e., sequences with a textual distance below 0.5 are considered as attack sequences, and those above 0.5 are considered normal sequences. Such a definition will consider all 56 sequences shown in Figure 8 as normal sequences, which are not. Thus, here we emphasize the strength of the model, which is the variable definition of the threshold that fits all system sequences.

Table 10

Comparison of the five similarity metrics on classification 2

Accuracy FNR FPR Precision Recall F1-score
Levenshtein 0.99 0.04 0 1 0.95 0.97
Jaro–Winkler 0.94 0.04 0.06 0.72 0.95 0.82
Jaccard 0.98 0.02 0.01 0.93 0.97 0.95
Hamming 0.98 0.04 0.009 0.94 0.95 0.95
Dice coefficient 0.90 0.04 0.10 0.61 0.95 0.74

The bold values indicate the best result for each metric (Accuracy, FNR, FPR, Precision, Recall, F1-Score and time).

Figure 9 and Table 10 shows a comparison according to classification 2 for the five similarity measures used. Jaro–Winkler/Dice coefficient gives the same recall value of 0.95 and the same FNR of 0.04. However, the Jaro–Winkler algorithm performs better, it gives a high value: accuracy of 0.94, precision of 0.72, and F1-score of 0.82 compared to Dice coefficient, which gives 0.90, 0.61, and 0.74 respectively. Jaccard/Hamming gives the same accuracy, 0.98, same F1-score, 0.95; however, the false positives of the Jaccard algorithm are greater than those of the Hamming algorithm. The last algorithm, Levenshtein, gives outstanding performance of 0.99 in terms of accuracy, 0.04 FNR, 0 FPR, 1 precision, 0.95 recall, and 0.97 F1-score. These excellent results are obtained because this algorithm processes the system call sequences to measure similarity.

Figure 9 
                  Comparison of the five similarity metrics on classification 2.
Figure 9

Comparison of the five similarity metrics on classification 2.

Before selecting the best method and classification, another parameter is taken into consideration, the elapsed time to implement the HIDS. Each algorithm’s running times in seconds are displayed in the following table:

Table 11 represents the running time for each algorithm. We notice that Jaro–Winkler is the fastest one with 720 s, Dice coefficient with 2,940 s, Jaccard algorithm with 3,060 s, Levenshtein algorithm with 3,300 s, and finally, Hamming took a significant time processing with 14,880 s. Jaro–Winkler’s speed is due to the implementation of the rapidfuzz library, which executes this algorithm in 0.00094 s and Levenshtein in 0.00312 s. Therefore, the long time to execute the Hamming algorithm is due to its implementation in the textdistance library, which takes about 0.03531 s.

Table 11

Elapsed time for each algorithm

Levenshtein (s) Jaro–Winkler (s) Jaccard (s) Hamming (s) Dice coefficient (s)
Classification 2 3,300 720 3,060 14,880 2,940

It should be noted that the version of Levenshtein’s algorithm described in the study [24] gives an accuracy of 1, FNR of 0, FPR of 0, Precision, Recall, and F1-score of 1, but the implementation time is very long, about 1 month and 15 days with the capabilities of the virtual environment described above.

From the confusion matrix in Figure 10, it can be seen in more detail that Jaccard, Hamming, and Levenshtein algorithms showed almost the same high-performance level and displayed a similar trend regarding correct and incorrect classifications. However, Jaro–Winkler and Dice coefficient showed a lower performance value than others. While having a generally higher performance in terms of elapsed time, Jaro–Winkler does have the second-highest score for FN and FP. Although this rating is not as bad as an FP rating, it is still high relative to the other algorithms, such as Levenshtein, Jaccard, Hamming, and Dice coefficient, which shows FN/FP of 30/0, 21/51, 31/43, and 35/442 respectively.

Figure 10 
                  Confusion matrix for all algorithm/classification 2.
Figure 10

Confusion matrix for all algorithm/classification 2.

The thing that caught our attention was the number of false negatives obtained by each algorithm, which ranged from 21 to 35. These values constitute the number of undetected attack sequences. This means that these attack sequences belong to the normal behavior (of the 833 TDM sequences). In this case, attack sequences may be identical to normal data sequences, and the model could not detect them. Instead, we use the occasion to highlight that if we eliminate these sequences from the ADM, the model becomes very efficient.

The most important thing to notice in this work is that all the described similarity measures give almost similar performances with different running times. This is due to how the presence of an attack is tested and, more precisely, how the threshold is defined. This is described in the BSA algorithm of this model.

To evaluate the proposed model, it is imperative to test it with other models which fall in the same field. From Table 12, all these models use ADFA-LD as a benchmark. We notice that BSA-HIDS (Jaro–Winkler) and BSA-HIDS (Levenshtein) give a high performance than that in studies [3,4] in terms of accuracy and false alarm rate, 94% accuracy, 5% FAR and 99% accuracy, 2% FAR using just 720 and 3,300 s, respectively. The result achieved by Marteau [3] was 90% accuracy in 900 seconds, and those achieved by Yaqoob and Madkour [4] were 90% accuracy and 22% FAR in seconds, but the number of seconds is unclear.

Table 12

Comparison of proposed HIDS with HIDS of the related work

Dataset Accuracy FAR Time (s)
BSA-HIDS (Jaro–Winkler) ADFA-LD 0.94 0.05 720
BSA-HIDS (Levenshtein) ADFA-LD 0.99 0.02 3,300
[3] ADFA-LD 0.90 / 900
[4] ADFA-LD 0.90 0.22 /

The bold values indicate the best result for each metric (Accuracy, FNR, FPR, Precision, Recall, F1-Score and time).

The proposed model has been developed aiming to have a performant HIDS, which is achieved and displayed in Table 12. The obtained results of the proposed system, BSA-HIDS, are superior to all up-to-date published systems in terms of accuracy, false alarm rate, and detection time. Although this model produced encouraging results, it does have limitations. The model cannot detect attack sequences that are an exact sequence of normal sequences.

6 Conclusion and perspectives

In this work, to identify unusual system call sequences, we have designed and implemented BSA-HIDS, a novel algorithm based on the sequence similarity measure. We used five similarity measures to test our model and choose the best one that performs well. The use case determines which string similarity algorithm is chosen. To generate the similarity score, all the algorithms mentioned earlier, in one way or another, seek to identify the common and uncommon strings’ components. Comparing our model to the most recent models, the following are its main advantages:

  • Its simplicity to implement; no definition for window size, no maximal length for the n-grams, and no hidden architectures (LSTM, HMM, and CNN).

  • Can easily be used for online exploitation.

  • Can detect zero-day attacks.

  • Real-time anomaly HIDS.

  • The threshold definition takes each system call sequence into account.

  • The proposed system provides the best combination of a high detection rate and very small running time.

The observed accuracy is significantly higher compared to all recent systems. Additionally, the suggested model offers the ideal fusion of rapid response (running time) and high detection rate. Because of the definition of threshold, it has a high ability to recognize the zero-day attack and is flexible enough to react to environmental changes.

We have identified a shortcoming of the BSA-HIDS model: it needs to distinguish between sequences that are exact sequences of the training set TDM. However, any alternative approach would need to handle the circumstance correctly.

Yet, certain shortcomings in the suggested HIDS still need to be considered for future work. The BSA algorithm need to be improved to lower the false alarm rate. As part of future work, we plan to test the model’s adeptness on other datasets like UNM and NSL-KDD. We aim to localize and delete attack sequences that are the same as normal sequences to test the model’s efficiency.

Finally, to optimize the present work, first, we can define a new similarity algorithm. Effectively, we will rewrite the Levenshtein algorithm to take into account words and not characters; this way, the execution time can be reduced. Indeed, since the high system call number consists of three digits, in the best case, the execution time can be reduced to 1,100 seconds (3,300/3 = 1,100 s). Second, we can minimize the number of files in each classification to minimize detection time.

  1. Conflict of interest: Authors state no conflict of interest.

References

[1] Finnerty K, Fullick S, Motha H, Shah JN, Button M, Wang V. Cyber security breaches survey. England, United Kingdom: University of Portsmouth Ageing Network; 2019.10.1016/S1353-4858(19)30044-3Search in Google Scholar

[2] Huma ZE, Latif S, Ahmad J, Idrees Z, Ibrar A, Zou Z, et al. A hybrid deep random neural network for cyberattack detection in the industrial internet of things. IEEE Access. 2021;9:55595–605.10.1109/ACCESS.2021.3071766Search in Google Scholar

[3] Marteau P.-F. Sequence covering for efficient host-based intrusion detection. IEEE Trans Inf Forensics Secur. 2019;14(4):994–1006. 10.1109/tifs.2018.2868614.Search in Google Scholar

[4] Yaqoob SI, Madkour MAI. Enhanced host-based intrusion detection using system call traces. J King Abdulaziz Univ Comput Inf Technol Sci. 2019;8(2):93–109. 1440 A.H./2019 A.D. 10.4197/Comp.8-2.7.Search in Google Scholar

[5] Creech G, Hu J. A semantic approach to host-based intrusion detection systems using contiguous and discontiguous system call patterns. IEEE Trans Comput. April 2014;63(4):807–19.10.1109/TC.2013.13Search in Google Scholar

[6] Pavithran P, Mathew S, Namasudra S, Srivastava G. A novel cryptosystem based on DNA cryptography hyperchaotic systems and a randomly generated Moore machine for cyber physical systems. Comput Commun. 2022;188:1–12. ISSN 0140-3664. 10.1016/j.comcom.2022.02.008.Search in Google Scholar

[7] Namasudra S. A secure cryptosystem using DNA cryptography and DNA steganography for the cloud-based IoT infrastructure. Comput Electr Eng. 2022;104(Part A):108426. ISSN 0045-7906. 10.1016/j.compeleceng.2022.108426 Search in Google Scholar

[8] Das S, Namasudra S. MACPABE: Multi‐Authority‐based CP‐ABE with efficient attribute revocation for IoT‐enabled healthcare infrastructure. Int J Netw Manag. April 2022. 10.1002/nem.2200.Search in Google Scholar

[9] Namasudra S, Crespo RG, Kumar SAP. Introduction to the special section on advances of machine learning in cybersecurity (VSI-mlsec). Comput Electr Eng. May 2022;100:108048. 10.1016/j.compeleceng.2022.108048.Search in Google Scholar

[10] Sarkar M, Saha K, Namasudra S, Roy P. An efficient and time saving web service based android application. Proj: Android Project NIC. August 2015.Search in Google Scholar

[11] Kumari S, Yadav RJ, Namasudra S, Hsu C-H. Intelligent deception techniques against adversarial attack on the industrial system. Int J Intell Syst. May 2021;36(5):2412–37.10.1002/int.22384Search in Google Scholar

[12] Liu M, Xue Z, Xu X, Zhong C, Chen J. Host-based intrusion detection system with system calls: Review and future trends. ACM Comput Surv. Nov 2018;51(5):1–36.10.1145/3214304Search in Google Scholar

[13] Lu Y, Teng S. Application of sequence embedding in host-based intrusion detection system. IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD); 2021.10.1109/CSCWD49262.2021.9437683Search in Google Scholar

[14] Frances O, Briana W. Deep learning-based hybrid model for efficient anomaly detection. Int J Adv Comput Sci Appl. 2022; 13(4):975–9.10.14569/IJACSA.2022.01304111Search in Google Scholar

[15] Zhang Y, Luo S, Pan L, Zhang H. Syscall-BSEM: Behavioral semantics enhancement method of system call sequence for high accurate and robust host intrusion detection. Future Gener Comput Syst. 2021;125:112–26. ISSN 0167-739X.10.1016/j.future.2021.06.030Search in Google Scholar

[16] Ouarda L, Malika B, Yousfi NE, Brahim B. Improving the efficiency of intrusion detection in information systems. J Intell Syst. 2022;31(1):835–54. 10.1515/jisys-2022-0059.Search in Google Scholar

[17] Kim J, Kim J, Le Thi Thu H, Kim H. Long short term memory recurrent neural network classifier for intrusion detection. International Conference on Platform Technology and Service; Feb 2016. p. 1–5.10.1109/PlatCon.2016.7456805Search in Google Scholar

[18] Lv S, Wang J, Yang Y, Liu J. Intrusion prediction with system-call sequence-to-sequence model. IEEE Access. 2018;6:71413–21. 10.1109/access.2018.2881561.Search in Google Scholar

[19] Yulianto MA, Nurhasanah N. The hybrid of Jaro-Winkler and Rabin-Karp algorithm in detecting Indonesian text similarity. J Online Inform. June 2021;6(1):88–95.10.15575/join.v6i1.640Search in Google Scholar

[20] Trouvilliez B. Textual data similarity for short opinion text learning and product search, Thesis. To obtain the degree of doctor of the University of Artois. Defended on May 13, 2013.Search in Google Scholar

[21] Logan R, Fleischmann Z, Annis S, Wehe AW, Tilly JL, Woods DC, et al. 3GOLD: Optimized Levenshtein distance for clustering third‑generation sequencing data. BMC Bioinforma. 2022;95:23.10.1186/s12859-022-04637-7Search in Google Scholar PubMed PubMed Central

[22] da Fontoura Costa L. Further Generalizations of the Jaccard Index. arXiv 2021, https://arxiv.org/abs/2110.09619.Search in Google Scholar

[23] Carass A, Roy S, Gherman A, Reinhold JC, Jesson A, Arbel T, et al. Evaluating white matter lesion segmentations with refined sørensen-dice analysis. Sci Rep. 2020;10(1):8242.10.1038/s41598-020-64803-wSearch in Google Scholar PubMed PubMed Central

[24] https://en.wikipedia.org/wiki/Levenshtein_distance.Search in Google Scholar

Received: 2022-11-07
Revised: 2022-12-11
Accepted: 2022-12-12
Published Online: 2023-04-10

© 2023 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Research Articles
  2. Salp swarm and gray wolf optimizer for improving the efficiency of power supply network in radial distribution systems
  3. Deep learning in distributed denial-of-service attacks detection method for Internet of Things networks
  4. On numerical characterizations of the topological reduction of incomplete information systems based on evidence theory
  5. A novel deep learning-based brain tumor detection using the Bagging ensemble with K-nearest neighbor
  6. Detecting biased user-product ratings for online products using opinion mining
  7. Evaluation and analysis of teaching quality of university teachers using machine learning algorithms
  8. Efficient mutual authentication using Kerberos for resource constraint smart meter in advanced metering infrastructure
  9. Recognition of English speech – using a deep learning algorithm
  10. A new method for writer identification based on historical documents
  11. Intelligent gloves: An IT intervention for deaf-mute people
  12. Reinforcement learning with Gaussian process regression using variational free energy
  13. Anti-leakage method of network sensitive information data based on homomorphic encryption
  14. An intelligent algorithm for fast machine translation of long English sentences
  15. A lattice-transformer-graph deep learning model for Chinese named entity recognition
  16. Robot indoor navigation point cloud map generation algorithm based on visual sensing
  17. Towards a better similarity algorithm for host-based intrusion detection system
  18. A multiorder feature tracking and explanation strategy for explainable deep learning
  19. Application study of ant colony algorithm for network data transmission path scheduling optimization
  20. Data analysis with performance and privacy enhanced classification
  21. Motion vector steganography algorithm of sports training video integrating with artificial bee colony algorithm and human-centered AI for web applications
  22. Multi-sensor remote sensing image alignment based on fast algorithms
  23. Replay attack detection based on deformable convolutional neural network and temporal-frequency attention model
  24. Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation
  25. Computer technology of multisensor data fusion based on FWA–BP network
  26. Application of adaptive improved DE algorithm based on multi-angle search rotation crossover strategy in multi-circuit testing optimization
  27. HWCD: A hybrid approach for image compression using wavelet, encryption using confusion, and decryption using diffusion scheme
  28. Environmental landscape design and planning system based on computer vision and deep learning
  29. Wireless sensor node localization algorithm combined with PSO-DFP
  30. Development of a digital employee rating evaluation system (DERES) based on machine learning algorithms and 360-degree method
  31. A BiLSTM-attention-based point-of-interest recommendation algorithm
  32. Development and research of deep neural network fusion computer vision technology
  33. Face recognition of remote monitoring under the Ipv6 protocol technology of Internet of Things architecture
  34. Research on the center extraction algorithm of structured light fringe based on an improved gray gravity center method
  35. Anomaly detection for maritime navigation based on probability density function of error of reconstruction
  36. A novel hybrid CNN-LSTM approach for assessing StackOverflow post quality
  37. Integrating k-means clustering algorithm for the symbiotic relationship of aesthetic community spatial science
  38. Improved kernel density peaks clustering for plant image segmentation applications
  39. Biomedical event extraction using pre-trained SciBERT
  40. Sentiment analysis method of consumer comment text based on BERT and hierarchical attention in e-commerce big data environment
  41. An intelligent decision methodology for triangular Pythagorean fuzzy MADM and applications to college English teaching quality evaluation
  42. Ensemble of explainable artificial intelligence predictions through discriminate regions: A model to identify COVID-19 from chest X-ray images
  43. Image feature extraction algorithm based on visual information
  44. Optimizing genetic prediction: Define-by-run DL approach in DNA sequencing
  45. Study on recognition and classification of English accents using deep learning algorithms
  46. Review Articles
  47. Dimensions of artificial intelligence techniques, blockchain, and cyber security in the Internet of medical things: Opportunities, challenges, and future directions
  48. A systematic literature review of undiscovered vulnerabilities and tools in smart contract technology
  49. Special Issue: Trustworthy Artificial Intelligence for Big Data-Driven Research Applications based on Internet of Everythings
  50. Deep learning for content-based image retrieval in FHE algorithms
  51. Improving binary crow search algorithm for feature selection
  52. Enhancement of K-means clustering in big data based on equilibrium optimizer algorithm
  53. A study on predicting crime rates through machine learning and data mining using text
  54. Deep learning models for multilabel ECG abnormalities classification: A comparative study using TPE optimization
  55. Predicting medicine demand using deep learning techniques: A review
  56. A novel distance vector hop localization method for wireless sensor networks
  57. Development of an intelligent controller for sports training system based on FPGA
  58. Analyzing SQL payloads using logistic regression in a big data environment
  59. Classifying cuneiform symbols using machine learning algorithms with unigram features on a balanced dataset
  60. Waste material classification using performance evaluation of deep learning models
  61. A deep neural network model for paternity testing based on 15-loci STR for Iraqi families
  62. AttentionPose: Attention-driven end-to-end model for precise 6D pose estimation
  63. The impact of innovation and digitalization on the quality of higher education: A study of selected universities in Uzbekistan
  64. A transfer learning approach for the classification of liver cancer
  65. Review of iris segmentation and recognition using deep learning to improve biometric application
  66. Special Issue: Intelligent Robotics for Smart Cities
  67. Accurate and real-time object detection in crowded indoor spaces based on the fusion of DBSCAN algorithm and improved YOLOv4-tiny network
  68. CMOR motion planning and accuracy control for heavy-duty robots
  69. Smart robots’ virus defense using data mining technology
  70. Broadcast speech recognition and control system based on Internet of Things sensors for smart cities
  71. Special Issue on International Conference on Computing Communication & Informatics 2022
  72. Intelligent control system for industrial robots based on multi-source data fusion
  73. Construction pit deformation measurement technology based on neural network algorithm
  74. Intelligent financial decision support system based on big data
  75. Design model-free adaptive PID controller based on lazy learning algorithm
  76. Intelligent medical IoT health monitoring system based on VR and wearable devices
  77. Feature extraction algorithm of anti-jamming cyclic frequency of electronic communication signal
  78. Intelligent auditing techniques for enterprise finance
  79. Improvement of predictive control algorithm based on fuzzy fractional order PID
  80. Multilevel thresholding image segmentation algorithm based on Mumford–Shah model
  81. Special Issue: Current IoT Trends, Issues, and Future Potential Using AI & Machine Learning Techniques
  82. Automatic adaptive weighted fusion of features-based approach for plant disease identification
  83. A multi-crop disease identification approach based on residual attention learning
  84. Aspect-based sentiment analysis on multi-domain reviews through word embedding
  85. RES-KELM fusion model based on non-iterative deterministic learning classifier for classification of Covid19 chest X-ray images
  86. A review of small object and movement detection based loss function and optimized technique
Downloaded on 12.5.2026 from https://www.degruyterbrill.com/document/doi/10.1515/jisys-2022-0259/html?lang=en
Scroll to top button