Framework for identifying network attacks through packet inspection using machine learning

Ravi Shanker; Prateek Agrawal; Aman Singh; Mohammed Wasim Bhatt

doi:10.1515/nleng-2022-0297

Artikel Open Access

Framework for identifying network attacks through packet inspection using machine learning

Ravi Shanker , Prateek Agrawal , Aman Singh und Mohammed Wasim Bhatt

Veröffentlicht/Copyright: 12. Juli 2023

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Nonlinear Engineering Band 12 Heft 1

Abstract

In every network, traffic anomaly detection system is an essential field of study. In the communication system, there are various protocols and intrusions. It is still a testing area to find high precision to boost the correct distribution ratio. Many authors have worked on various algorithms such as simple classification, K-Means, Genetic Algorithm, and Support Vector Machine approaches, and they presented the efficiency and accuracy of these algorithms. In this article, we have proposed a feature extraction technique known as “k-means clustering,” which has its roots in signal processing and is employed to divide a set of n observations into k clusters, each of which has its origin from the observation with the closest mean. K-Means method is applied in this study to investigate the stream and its implementation and applications using Python and the dataset on the KDDcup99. The effectiveness of the outcome indicates the planned work’s efficiency in relation to other widely available alternatives. Apart from the applied method, a web-based framework is designed, which can inspect an actual network traffic packet for identifying network attacks. Instead of using a static file for testing the network attack, a web page-based solution uses database to collect and test the information. Real-time packet inspection is provided in the proposed work for identifying new attacks.

Keywords: anomaly; network layer; packets; DoS; IDS; attacks; machine learning; KDDcup99; KNN; K-Means.

1 Introduction

The primary elements on which the detection and its misuse frequently depend are the patterns that have been created earlier; something that is still being discovered or that has not yet been identified can be identified by the method of anomaly detection. It is essential to bear in mind that anomaly detection cannot identify suspicious or potentially suspicious behavior; only the recognition of an intruder can do this [1]. We can state that an intentionally observed phenomenon is independent of both the environment and norms and regulations. The assumption that network traffic under typical conditions has identifiable features is one of the two presumptions on which anomaly detection is built. Parameters may be used to construct a representation of these natural conditions. The second assumption is that anomalies in this normal model are uncommon and may be caused by disruptive behavior. These two assumptions are based on what has been published in the field literature.

1.1 Anomaly detection

Anomaly identification can be separated into two steps as a technique. The first step involves developing a specification for traditional network traffic. Model generation algorithms or mathematical simulations may be used to generate or learn this model from training data. Traffic is monitored for variations from the standard model in the second step. Using features from the traffic, a model of regular network traffic is developed. A value or symbol that represents network traffic is referred to as a feature in the sense of anomaly detection. These characteristics should reflect traffic behavior and characteristics, while still containing no redundant information to be as light as possible [2].

1.2 Statistical-based anomaly detection

In a process that depends on statistics, deviations are found using statistics. Mathematical methods are used to create historical models. Then, any discrepancies between these simulations and the real-world situation are categorized as anomalies. After a deviation is tracked, its severity is measured and rated. The severity of the phenomenon increases with grade level. For example, the total amount of time a person has spent using the network is compared to the existing count. If the current number of network accesses is one or two more than usual, it is not always a severe situation. Nevertheless, a significant phenomenon exists regardless of how much higher the number is, and in some cases, it is 100 times greater. In general, it depends on how the grading guidelines are written [3].

1.3 Rule modeling-based anomaly detection

In a rule modeling-based approach, the system’s rules are specified, and when they are violated, those instances are labeled as anomalies [4]. In principle, this is similar to how firewalls work. Predefined guidelines are balanced against network traffic by firewalls. If the traffic is not in violation of these laws, it is permitted to proceed. Anything that violates these guidelines is discarded. This means that something that deviates from the norm is considered an exception in anomaly detection [5].

To summarize, the objective of the article is to increase the right distribution ratio while achieving high precision, which is still a proving area. Numerous writers have investigated a number of algorithms, including basic classification, K-Means, genetic algorithms, and Support Vector Machine (SVM) techniques, and they have demonstrated the effectiveness and accuracy of employing these algorithms. We have introduced the feature extraction method known as “k-means clustering” in this study. This method, which has roots in signal processing, divides a set of n observations into k clusters, each of which derives its origin from the observation with the closest mean. This study uses Python and a dataset from KDDcup99 to analyze the stream and its implementation and applications. The efficiency of the result reveals how well the intended job was performed in relation to other available alternatives.

Organization: The article is structured into many modules, where the first section states the introduction, followed by Section 2, which identifies network layer attacks. Section 3 states the literature survey, followed by Section 4, which discusses the problem definition. Section 5 states the proposed algorithm. The penultimate section of the manuscript discusses the result analysis, and the ultimate section is the conclusion and the future work section.

2 Identify network layer attacks

Anomaly detection models are built using a machine learning approach based on previous behavior. The learning algorithm examines previously observed datasets containing network traffic, e.g., to develop a normal behavioral model [6]. Once the phase of learning is over, the variations are checked for by the detector of the newly developed model. For example, in any case, whereby there is a transmission of any application to the devices working locally in the network and previously unknown traffic is created by this application, as depicted in Figure 1, a detector based on machine learning is likely to show responses in the networking traffic [7].

Figure 1

IDS functions.

2.1 Data collection

The intrusion detection system (IDS) is tasked with receiving the data, which this module fulfills. A chronicle is kept, and then the material is written down. For example, network-based intrusion detection system (NIDS) (structure-based IDS) keeps track of how the framework is used and how it is implemented.

2.2 Feature selection

To choose the right part, there is a lot of information in the structure, and they are usually measured for resistance. For example, the Internet Protocol (IP) addresses, display type, header length, and size of the source and target structures can all be used as interference keys [8].

2.3 Analysis

To discover the accuracy, the investigation of information is finished. Standard-based IDSs breaks down the information where approaching traffic is inspected against a predefined mark or an example. Anomaly-based IDS is another approach that makes use of numerical models to analyze the behavior of the framework.

2.4 Action

This capability is concerned with the framework’s response to an attack that is destructive and intolerable. Depending on what is required, it will either send an email/alert symbol or have a functional effect on the framework by falling packets so they do not reach it or closing the ports. Interruptions can be defined as a series of activities that attempt to balance the integrity, privacy, and accessibility of assets on a framework [9].

Integrity: Data transparency refers to making sure the data has not been tampered with either in transit or storage. When networking terminals and servers are checked and their physical environments are controlled, their data access is limited, and procedures of robustness in authentication are ensured, all these steps are taken to ensure maximum privacy.
Confidentiality: Unauthorized users have not downloaded or revealed records indicating secrecy.
Availability: On the basis of opportunities, accessibility ensures that systems are sufficiently strong and open to customers (e.g., when clients require them). Refusal of administration, on the other hand, occurs when clients are unable to obtain the funds they need on a timely basis.

3 Literature survey

According to Siraj et al., security analysis trails will play a vital role in a PC framework’s security program [10]. The aim of this investigation was to improve the PC security inspection and observation capabilities of the client’s frameworks. Anderson proposed that by identifying variance from strictly following system, it will be possible to differentiate between an impostor and a legitimate customer. Denning introduced a model for evaluation documents that was presented as a guideline-based example organizing structure. The model includes profiles for speaking to standard behavior, as well as articles about rules and any significant divergence from normal behavior, referred to as an abnormal example. The created examination reports are linked to the established leads and reviewed for suspicious activity.

The author has suggested an inherited estimation approach-based intrusion detection method [11]. Then, using principal component analysis (PCA), you were able to retrieve the most important highlights. She also used genetic measurements to establish consistency benchmarks for the highest health esteems in each age group. Rules that were made were used to sort out the interruptions and regular connections in the testing results.

The study proposed fuzzy clustering artificial neural network (FC-ANN) to improve the accuracy and precision rate. Using fuzzy clustering, he divided the planning data into different subsets. Artificial neural network (ANN) was linked to each subset of planning data to get a full understanding of the system. He used the fuzzy clustering array module to fit and join the various ANNs’ results at the height of studying each subset. In comparison with back-propagation neural network and Nave Bayes, he demonstrated that the proposed model has improved implementation [12].

Yao et al. have suggested host-based IDS that can detect changes in the equipment profile using fuzzy clustering. To test the exactness and exploration of the framework, he used the framework execution log. He selected influential highlights and created fluffy IF-THEN criteria with the aid of deviation methodology [13]. The Mamdami induction method was used to determine the precise behavior of the generated framework log. Baghdad used five types of neural constructs neural network (NN) to figure out which NN group assaults well and prompts a faster rate of disclosure for each assault. In the multiclass scenario, he noted that the gaussian free field neural network prompts the strongest uncertainty network out of five NNs: multilayer perceptron (MLP), summed forward gaussian free field, spiral premise work (RBF), self-association involve delineate, and primary section review (PCA) (denial-of-service [DoS], user-to-root [U2R], remote-to-local [R2L], and Probe). In the same situation, RBF simulates the DoS assault class’s faster recognition rate. Principal components analysis neural network has a higher identification rate when dealing with a particular class (Normal or Assault) [14].

An intelligent alarm clustering model for a system based disruption location architecture is investigated by this study to separate interference warnings and shift through the unwanted alerts, they suggested a novel combination of enhanced inhalation unit risk, Principal component analysis (PCA), and electromagnetic amplification. To weed out unwanted or false positive alerts, they assigned each warning a level of seriousness (high threat, medium hazard, and generally safe). To classify disruption notices in device flow, they used a fluffy preference tree classifier. She used mutual partnership, including choice calculations, to choose the top 10 highlights and then linked the fluffy C4.5 choice tree calculations to the planning dataset. The findings of the experiments showed that her proposed structure was 99% correct in characterization [15].

Anomaly detection in network communication helps in detecting attacks, failures, and misconfigurations in networks. To improve the detection quality of algorithms, removable of irrelevant, redundant, and correlated features is necessary. In the article, Kaya et al. [16] reports on the feature selection problems in network traffic. The author proposed a multi-stage feature selection technique based on filters and step-wise regression wrappers to reduce the 41 features to 16 features with the help of combined feature selection. For making a decision, different classification models are used, such as Naïve Bayes classifiers, K-Nearest Neighbor, Decision Trees, Least Absolute Shrinkage and Selection Operator-Least Angle Regression, SVMs, and Artificial Neural Networks for feature selection. The basic techniques that are used for feature selection are wrappers, filters, and hybrid or embedded. It was concluded in the article that out of 41 features, 16 showed high contribution, 14 showed low contribution, and 11 were negligible for intrusion detection. They carried out many experiments that revealed the redundant features and correlation among the features [16].

With computers and the Internet becoming an integral part of our daily lives, the number of Web applications available online has grown exponentially. With the growing number of Web applications, the number and variety of online data exposures have increased. Login systems have been successfully used to detect web attacks and unauthorized access requests. In this study, machine learning strategies, Bayesian network, SVM, neural networks, K-Nearest Neighbor, and decision-making trees were used to evaluate sequence tagged site success, as well as the success and timing of the editor based on attack types. In experimental studies, KDDCup99 datasets were used [17].

With the advancement of information technology, many intrusion detection issues, such as cyber security, have emerged. The basic infrastructure for detecting a variety of attacks is provided by an intrusion detection system. This study focuses on the intrusion detection problem in network security. The primary goal is to determine whether network behavior is normal or abnormal. In this study, two different machine learning algorithms were combined to reduce their weaknesses while utilizing the best features of both algorithms. Its experimental results outperform other algorithms in terms of performance, accuracy, and false positive rate. These combined algorithms were applied to the KDDCup99 dataset to improve its performance, accuracy, and false positive rate [18].

This work suggests a Deep Learning-based methodology for identifying attacks and anomalies as irregular or regular in a virtual smart environment using the BoT-IoT dataset. These methods include deep neural network and long-short term memory-recurrent neural network classifiers for classification purposes [19]. The study’s goal was to investigate how an intrusion detection system should be represented using the A3C algorithm. This study analyzes the following: to assess the performance of the A3C algorithm in outlier detection, to assess the existing machine learning approaches used within IDS, to choose the best training set, and to get ready for the deployment of A3C. To examine these goals, a theoretical study was conducted [20].

A detailed analysis is done on the NSL-KDD dataset. NSL-KDD datasets overcome the redundancy of the KDD dataset. This new dataset also contains some problems, but this can be applied as it gives more effective results than the present dataset. The analysis of this dataset is done using Weka tool. This tool is used for developing new schemes for machine learning. On this proposed dataset, many experiments are performed using different algorithms. Random Forest algorithm shows high accuracy compared to other algorithms. The analysis result shows that to detect intrusion attacks, NSL-KDD datasets are more ideal than comparing other intrusion models [21]. Many techniques were proposed, but those had high computational complexity and were based on a heuristic approach. Therefore, to overcome the deficiencies, the authors of this article proposed a new machine learning approach called a two-tier classification model based on Naïve Bayes, KNN classifier, and for dimension reduction by Linear Discriminant Analysis. Reduction of dimension and data preprocessing for better decision-making come in the first tier. In the second tier, k-nearest neighbors-collaborative filtering is performed to achieve better performance. This model coverts high-to-low measurement dataset by the above-stated classifier of machine learning. This proposed model performs better than the existing ones. The detection rate of this model is comparatively high, and it can detect dangerous or rare attacks.

The most effective methods of intrusion detection are those that rely on machine learning. Machine learning-based techniques can be used to improve the current intrusion detection system’s performance. Using machine learning methods such as SVM and Structural Sparse Logistic Regression (SSPLR), the authors analyzed the valuable aspects that can identify attacks and abnormal behavior. The main goal of SSPLR is to execute feature selection to achieve better performance of IDS systems. It uses the shrinkage technique, which means that the features contained in the regression arrangement are used to attain discriminant feature selection. SSPLR is an advanced method that examines and processes data through regularization. On the other hand, the SVM technique is a two-stage approach, and it takes more time for testing and training. However, the performance of SVM is reasonable because of the enormous computational cost and interpretability. Therefore, the authors concluded that SSPLR is used to isolate group feature selection and SVM for the classification of network intrusion. Therefore, in this article, the main focus is on input features, which help in analyzing the types of attacks [22].

4 Problem definition

To increase the right distribution ratio and find high precision is still a research area; hence, it will be the objective of the article. Numerous writers have investigated a number of algorithms, such as basic classification, K-Means, genetic algorithms, and SVM techniques, and they have demonstrated the effectiveness and accuracy of employing these algorithms. We have introduced the feature extraction method known as “k-means clustering” in this study to detect and classify the anomalies. The algorithms of negative selection and neural network architecture were used to analyze various anomaly detection techniques.
In previous work, signature-based techniques were used to identify attacks using the database they developed. This approach is very effective, but the database must be maintained on a regular basis, and new attack information must be analyzed.
The anomaly-based approach relies on analyzing network activity to identify irregular network activity. This was not the case in previous works that employed the signature form. This technique is useful against zero-day attacks because it has been good at identifying attacks it has never seen before.
However, the previously used approach has limited real-world applicability. This is solved by using an anomaly-based approach to new threats, resulting in a total of 24 attacks.

5 Proposed algorithms

In this study, the algorithms of negative selection and neural network architecture were used to analyze various anomaly detection techniques. K-MEANS is presented in the proposed work. The steps taken for the proposed work’s implementation are listed below. To start, step 1 is to load the KDDcup99 dataset. The data were preprocessed using the internet infrastructure. The protocol classification on refined data employs a negative selection method, which saves time and allows for further classification depending on the given function.

5.1 Classification algorithm

The information extraction is performed using the negative selection algorithm, which is then extended to the K-MEANS classification. Finally, depending on the final classification stage, intrusion or non-intrusion data is assigned.

5.2 Diagram of the process

The method flow is depicted in Figure 2.

Figure 2

Flow diagram of the complete process.

The flow map of the device that takes feedback and processes the attack detection for further classification is seen in Figure 2. KDDCup99 dataset is taken and pre-processed to identify if there are any NaN or null values. Once the data are preprocessed, it checks for all the features that lead to the identification of network attacks. Once the features grouping for such negative data classification is done, a label is set for each category of attack in the training data. The machine learning approach is applied after classification to detect attacks. In this work, k-mean classification is done, and the process repeats until the training is done. This training data can be further applied to real-time network packets to identify new attacks.

5.3 Pseudocode

The pseudocode functionality of the proposed approach.

Input: KDD dataset library initialization, indexed dataset storage.

Output: Data classification, computational parameters, confusion matrix, comparison analysis graphically and statically.

Steps:

Begin [

Loading DS (i-n) KDD #KDDCup dataset is used

{

DataloadModel () #Processing for any NaN or Null values and to fit into ML

}

Processing of protocol Selection () # Feature selection applied for each type of attack with label

{

NegativeprotocolCollect (); # Attack and benign classification based on the features is done.

Filtration ();

NegativeSelection;

NegSelRefine ();

}

Obtaining refined data;

Processing of K-MEANS: #Training to the algorithm with reduced selected features

K-MEANS ()

{

Initializing of N layer values;

Computation of sigmoid Function ();

hidden LayerInit ();

weight Optimization ();

}

Performing K-MEANS Classify(); # applying the K-Means to check the efficiency

Return classification results;

Computations()

{

confusionMatrix();

efficiencyParam();

}

Return comparison analysis;

]

end;

Thus, the above pseudocode discusses the core functions of the algorithm performed over the execution of the proposed system.

6 Result analysis

There was a high degree of accuracy and detection rate in the results of the experiments. An optional result is discovered when the experiment findings are compared to the conventional technique [23]. A comparison between the offered solution and the suggested work demonstrates the validity of the latter. Discussions on computation parameters and observed results are included here.

Our article was written using the NetBeans IDE. For the purposes of our simulation, we needed some hardware and machine interfaces. The Hardware used for the proposed system are Intel core i7, 1TB Hard disk, and 8GB RAM. The Software used was Windows 10, Language – Python, Database – CSV files, and IDE – NetBeans and Xampp.

6.1 Programming environment

6.1.1 Python

The programming language that is general purpose and high level is known as Python, and it is also interpreted. The programming of Python and its philosophy are based on the readability of the code, which is its main priority. The extensive indentation utilization is a demonstration of it. Object-oriented style and language constructs were specially developed to help out the programmers. The programmers can write in simple ways with logical code for their projects, either smaller or larger.

The programming language Python is a dynamic kind of program and is garbage collected. Programming paradigms of a large variety are supported by it and some other programs that are procedural and structured, object-oriented, and functional [24]. The standard library of Python is large enough, and hence, in some cases, it is also known as a language that is highly battery-oriented.

6.2 KDD Cup 99 dataset

With the help of the KDD Cup 99 dataset, our standard estimations were carried out. The KDD Cup 99 dataset is a standard set of independently verifiable data, including a wide range of simulated intrusions into a military network environment, and is used to carry out the investigations by employing the proposed techniques in our research. For testing IDS, it serves as a standard. In 1998, the software was developed and was named DARPA Intrusion Detection Assessment Software. MIT Lincoln Laboratory was the key operator of this software. The aim of the evaluation program is to evaluate the research and investigation in the interruption exploration. A common dataset was implemented by the testing program, e.g., the dataset of DARPA98. This dataset was inclusive of interruptions of a large variety that were recreated in the army device scenarios. To evaluate the exhibition of IDS, the DARPA98 dataset was used as both a preparation and research dataset. Lincoln Laboratory arranged for 9 weeks to dump transmission control protocol (TCP) data that was supposed to be extracted. Data for 7 weeks of planning are included in the DARPA98 dataset, and in the same way, 2 weeks were its study period. For the DARPA98 dataset, the KDD cup 99 dataset is considered as an updated version. In collaboration with KDD99, a competition was organized and named as Third International Knowledge Discovery and Data Mining Tools Competition. On Knowledge Discovery and Data mining, the fifth international conference was also conducted with the help of the dataset of KDD Cup 99. The aim of the test was to create a device interruption marker, a predictive model capable of detecting bad associations, also known as interruptions or dangerous untolerable attacks, as well as great typical associations.

The interruption recognition dataset KDD 99 is made up of several parts: kdd cup data: It contains full preparing information, e.g., traffic systems and their 7 weeks. The number of records presented is 4,940,210.
kddcup.data_10_percent: comprises 10% of dataset in full preparing. The number of records presented is 494,021.
kddcup.testdata.unlabeled: fully tested information and data is present in it. In this set, each of the test information is unlabeled. The number of records presented is 2,984,153.
kddcup.testdata.unlabeled_10_percent: for fully unlabeled and tested data, it contains only 10%. The number of records presented is 311,029.
kddcup.newtestdata_10_percent_unlabeled: the refined number of 10% is present in it for the fully unlabeled testing data. The number of records presented is 311,079.
Corrected: in its named structure, it contains the testing data or information. The number of records presented is 311,029.

6.3 Different attacks

In KDD Cup 99 dataset, there are 25 different forms of attacks. There are four categories of multiple assaults. U2R attacks, R2L, and probe attacks are all included in DoS attacks.

6.4 Training and testing dataset for the proposed framework

On the subset of 10% KDD dataset, this system is generally tested on the basis of a larger number of records of training sets in the 10% KDD'99 data collection. One categorical highlight in the Labeled Dataset indicates whether the attack is general or precise in nature. The planning dataset was kddcup.data 10%, and the checking dataset was rectified. There were 490,015 connections in the subset used for preparation, with 392,737 assaults and 97,278 regular ones. There were several different networking attacks in this given subset, and these are named as: ipsweep, neptune, pod, smurf, teardrop, back, guess passwd, ipsweep, neptune, pod, smurf, teardrop 2,203 return, guess passwd, 1,247 ipsweep, 107,201 neptune, 264 pod, 280,790 smurf, and 979 teardrop attacks are some of the main kinds of attacks that are included in this particular training dataset. In the research or investigative dataset, there were as many as 288,555 connections; the total number of attacks was 227,962; and the total attempts recorded were 60,593. There were as many as 1,098 back cases; those who guessed passwords were 4,367; ipsweep was 306; Neptune was 58,001; pod was 87; smurf was estimated to be 164,091; teardrop attacks were 12 kinds; and all these are included in the dataset of this investigation.

6.5 DoS

The DoS assault takes place. The interloper or programmer sends a lot of solicitations, making the server highly occupied, and in the same way, it keeps the memory assets occupied as well for the consideration of serving some highly genuine structures and administrative demands. Finally, it also denies the entrance of any client into a particular machine. Under DoS, attacks of many kinds are included:

Back attack: Under DoS class, the bad attack comes at first. It was initially launched on the Apache web server, which at the time was flooded with numerous requests with front slashes (/) in large numbers, and this is generally a character in the URL description.
Neptune attack: The Neptune attack is also included in the DoS class. Memory resources can be made unreasonably full for any unexpected type of loss by delivering the TCP packet to begin the session of TCP as previously described. It means memory resources can be made extremely full for any unforeseen sort of loss by delivering the TCP packet to begin the session of TCP as previously explained. Simultaneously, this package contains a three-way handshake, and it is intended that it will result in TCP relationships being established between two hosts. As a result, the unfortunate fatality is unable to finish the handshake, despite the fact that a significant amount of framework memory was distributed for this relationship. After a huge number of these packages have been sent, the injured person runs out of memory resources.
Pod attack: “The ping of death” occurs when an attacker delivers an excessively large packet size (more than 65,535) as a ping request. In spite of the fact that it is illegal to send a ping heap of this size, a package of this size can be sent if it is separated. The victim’s computer frequently crashes as a result of a cushion flood in this attack.
Smurf attack: Internet Control Message Protocol (ICMP) is included in attacks of this kind in enormous quantities. These bundles are there with ridiculed sources of the suggested injured individuals for IPs. These communicate to the system with the help of broadcasting IP addresses. On the system, as a result of this, all hosts are supposed to answer to the demand of ICMP, and hence, this makes the traffic genuine for the PC’s unfortunate casualty. If n hosts, e.g., are connected to the network, in this case, the attacker can also make a simple host list for packet reply to the victim, which can be done by sending packets to the concerned network.
Teardrop attack: In the case of teardrop attack, the IP of the attacker is added with a puzzled value offset in the fragments subsequent to it. If the system fails to recognize a way to handle circumstances like this, the system is likely to crash as well.
DoS Slowloris attack: On Wednesday, DoS Slowloris attacks are likely to happen, and they also correspond to other DoS attacks in the data set named as CICIDS2017. The depletion of device resources of victim is the main aim of this kind of attack, and it can block legal or legitimate users from data reception.
DoS SlowHTTPTest attack: Every Wednesday, the DoS SlowHTTPTest is held, and to other DoS attacks in the dataset of CICIDS2017, these kinds of attacks correspond. The TCP window size functionality provides additional benefits for attacks of this kind, and hence, the resources of the victim’s server are wasted and legitimate users fail to use their provided services.
Heartbleed attack: Heartleech, a tool written in the C programming language, is exploited in the Heartbleed exploit tool, and in this assault, it is used. This type of attack occurs on Wednesday afternoon, as confirmed by the CICIDS2017 dataset report.

6.6 U2R or client-to-root

Attacks of this kind are highly intolerable and harmful. With a typical client record and endeavors, the programmer starts on a PC, and the aim is to abuse the PC vulnerabilities for the enhancement of client benefits predominantly.

6.7 R2L or remote-to-user

Attacks of this kind are harmful and intolerable, where packets are sent to gadgets by a client to the computer organizer. There is no such entrance for the attacker, and therefore, to uncover the vulnerabilities of the gadget and endeavor benefits the client has, his system is badly affected.

Guess_Passwd attack: R2L attacks also include Guess_passwd attacks. These special kinds of attacks comprise an intruder that keeps on guessing the user’s password and potential passwords; the main aim is to gain access to the account of any user. For an attack, any such service requiring a password to enter is often the priority.

6.8 Probing

This is the case when hackers check the networking structures and computers for any useful bugs that can be utilized for the system breach.

Ipsweep attack: IPSweep strike is included in the Probe class. In the sweep monitoring processes, the attack of IPSweep makes a decision regarding the networking host. The operating hosts are specified by this along with various kinds of services; the details are then used by attackers for the attack and for this purpose, and they always look for compromised computers.

6.9 False positives and false negatives

There are four potential events whose ratio is tracked to determine the IDS’s performance and identification accuracy. These events are represented in the Figure 3.

There are some legitimate events that are regarded as false negatives, and these are also labeled mistakenly as anomalous. Those events that are identified with accuracy are known as true positives. The anomalous events are also known as false negative events, whereby the detectors are ignored and hence anomalous attacks cannot be identified [25]. The events that are true negatives are those that have been properly identified as lawful actions. A network operator must evaluate the phenomenon or interference to decide if it is a false positive or false negative.

6.10 Network data mining

Network data mining (NDM) may be used for a variety of purposes. The first step is to produce information about the monitored data that have been processed. This allows for the identification of dominant traits and outliers in data records that may be considered anomalous or suspect. Second, NDM may be used to describe rules or patterns that are unique to some types of traffic, such as standard web traffic or traffic seen during a DoS attack. These principles and trends can be extended to new collections of monitoring data to see if they have the same properties and features as the first. NIDS and traffic analyzers that define and identify traffic flows are two obvious implementations that benefit from such laws and patterns.

Accuracy analysis:

A “True Positive” is a fact if interference is precisely predicted. We will term it “True Negative” if we do not find any dangers. If IDS detects an intruder but the statement is inaccurate, a “False Positive” alert is raised. A “False Negative” (FN) incident occurs when a non-intruder is found and the intruder continues to operate. This is the worst-case scenario, in which all identifying conditions are operating well and a false negative is issued. To test our IDS, we used these words: accuracy and identification rate. Total results divided by total intruders are termed as Accuracy (ACC).
ACC = TP + TN TP + FP + FN + TN .
Detection rate:

On the other side, the detection rate (DR) is the percentage of alarms that lead to real intrusions.
DR = TP TP + FN .
Statistical analysis:

Table 1 summarizes the results of our study in accordance with the above notions. Our IDS is capable of detecting abnormalities with a high degree of precision and a high rate of detection.

Table 1

Illustrations of the attacks that generally fall into four major existing categories

DOS attacks	Back, land, Neptune, pod, smurf, teardrop, DoS SlowHTTPTest, DoS Slowloris, and Heartbleed Attack
U2R attacks	Butter_overflow, loadmodule, Perl, and rootkit
R2L attacks	Ftp_write, guess_passwd, imap, multihop, phf, spy, warezclient, and warezmaster
Probes	Satan, ipsweep, nmap, and portsweep

Precision = TP/(TP + FP)

Recall = TP/(TP + FN)

F-Measure = 2 × ((Precision × Recall)/(Precision + Recall))

There is a difference analysis between the methods, which can be seen in Table 2.

Table 2

An investigation into the effectiveness of several algorithms against the BOT attack

Algorithms	Accuracy	Detection rate time	Precision	Recall
KNN	8.2	0.82	0.85	0.85
K-MEANS (proposed algorithm)	9.0	0.98	0.96	0.95

6.11 Graphical representation for comparison analysis of algorithms

An analysis of the result is presented in Figure 3. This section gives an understanding of the statistical graphical analysis.

Figure 3

Comparison between the analyses of the algorithms.

This section compares the results of the old technique and the suggested approach based on the implementation’s measured outcomes. KNN Algorithm results are shown in Figure 4, and K-Means Algorithm results are shown in Figure 5.

Figure 4

KNN algorithm graph.

Figure 5

K-Means algorithm graph.

7 Conclusion and future work

This work provides a K-Means machine learning methodology for detecting DoS assaults, Probe attacks, U2R attacks, and R2L assaults using a comparative analysis. Using the K-means technique, we can determine how similar the attack groups are. Then, to classify normal and attack connections, the tests demonstrate that the KDD Cup 99 dataset may be used as a useful benchmark dataset for comparing different intrusion detection techniques. Future work will include examining how it can classify attack categories using other data mining algorithms, as well as how it can detect other real-time environment datasets.

Funding information: This research does not receive any kind of funding in any form.
Author contributions: Ravi Shanker: conceptualization; data curation; project administration; formal analysis; supervision; investigation; methodology; roles/writing – original draft. Prateek Agrawal: conceptualization; data curation; resources; formal analysis; investigation; validation; visualization; methodology; roles/writing – original draft. Aman Singh: conceptualization; data curation; software; formal analysis; validation; visualization; investigation; methodology; roles/writing – original draft. Mohammed Wasim Bhatt: conceptualization; data curation; supervision; formal analysis; investigation; methodology; writing – review & editing.
Conflict of interest: The authors report that they have no conflict of interest.
Data availability statement: Data will be made available on request.

References

[1] Hoque MS, Mukit MA, Bikas MA. An implementation of intrusion detection system using genetic algorithm. ArXiv; 2012. 10.48550/ARXIV.1204.1336.Suche in Google Scholar

[2] King CM, Dalton C, Osmanoglu E. Security architecture: design, deployment and errands. New York (NY), USA: McGraw Hill; 2003.Suche in Google Scholar

[3] Marinova-Boncheva V. A short audit of intrusion detection system. Probl Eng Cybern Robot. 2007;58:23–30.Suche in Google Scholar

[4] Borrelli NF, Seward TP, Koch KW, Lamberson LA. Anderson localization light guiding in a two-phase glass. J Mod Phys. 2022;13(5):768–75. 10.4236/jmp.2022.135045.Suche in Google Scholar

[5] Denning DE. An intrusion-detection model. In IEEE Transactions on Software Engineering. 1987;SE-13(2):222–32. 10.1109/tse.1987.232894.Suche in Google Scholar

[6] Banković Z, Stepanović D, Bojanić S, Nieto-Taladriz O. Improving network security using genetic algorithm approach. Comput Electr Eng. 2007;33(5–6):438–51. 10.1016/j.compeleceng.2007.05.010.Suche in Google Scholar

[7] Wang G, Hao J, Ma J, Huang L. A new approach to intrusion detection using Artificial Neural Networks and fuzzy clustering. Expert Syst Appl. 2010;37(9):6225–32.10.1016/j.eswa.2010.02.102Suche in Google Scholar

[8] Om H, Kumar Gupta A. Design of host based intrusion detection system using fuzzy inference rule. Int J Comput Appl. 2013;64(9):39–46. 10.5120/10666-5442.Suche in Google Scholar

[9] Beghdad R. Critical study of neural networks in detecting intrusions. Comput Secur. 2008;27(5–6):168–75.10.1016/j.cose.2008.06.001Suche in Google Scholar

[10] Siraj MM, Maarof MA, Hashim SZM. Intelligent alert clustering model for network intrusion analysis. Int J Adv Soft Comput Appl. 2009;1(1):33–48.Suche in Google Scholar

[11] Hlaing T. Feature selection and fuzzy decision tree for network intrusion detection. Int J Inform Commun Technol (IJ-ICT). 2012;1(2):109–18. 10.11591/ij-ict.v1i2.591.Suche in Google Scholar

[12] Ritchey RP, Perry R. Machine learning toolkit for system log file reduction and detection of malicious behavior. IEEE INFOCOM 2021 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS); 2021 May 10–13; Vancouver (BC), Canada. IEEE, 2021. 10.1109/infocomwkshps51825.2021.9484572.Suche in Google Scholar

[13] Yao Q, Shabaz M, Lohani TK, Wasim Bhatt M, Panesar GS, Singh RK. 3D modelling and visualization for vision-based vibration signal processing and measurement. J Intell Syst. 2021;30(1):541–53. 10.1515/jisys-2020-0123.Suche in Google Scholar

[14] Salo F, Injadat M, Nassif AB, Shami A, Essex A. Data mining techniques in intrusion detection systems: A systematic literature review. IEEE Access. 2018;6:56046–58. 10.1109/access.2018.2872784.Suche in Google Scholar

[15] Li C, Niu H, Shabaz M, Kajal K. Design and implementation of intelligent monitoring system for platform security gate based on wireless communication technology using ML. Int J Syst Assur Eng Manag. 2022;13:298–304. 10.1007/s13198-021-01402-6.Suche in Google Scholar

[16] Kaya C, Yildiz O, Ay S. Performance analysis of machine learning techniques in intrusion detection. 2016 24th Signal Processing and Communication Application Conference (SIU); 2016 May 16–19; Zonguldak, Turkey. IEEE, 2016. 10.1109/siu.2016.7496029.Suche in Google Scholar

[17] Saini GK, Chouhan H, Kori S, Gupta A, Shabaz M, Jagota V, et al. Recognition of human sentiment from image using machine learning. Ann Romanian Soc Cell Biol. 2021;1802–8.Suche in Google Scholar

[18] Revathi S, Malathi A. A detailed analysis on NSL-KDD dataset using various machine learning techniques for intrusion detection. Int J Eng Res Technol (IJERT). 2013;2(12):1848–53.Suche in Google Scholar

[19] Jan S, Masoodi F, Bamhdi AM. Effective intrusion detection in IoT environment: deep learning approach. In: Pal R, Shukla PK, editors. SCRS Conference Proceedings on Intelligent Systems. Soft Computing Research Society, 2021. p. 495–502. 10.52458/978-93-91842-08-6-47.Suche in Google Scholar

[20] Zhou K, Wang W, Hu T, Deng K. Application of improved asynchronous advantage actor critic reinforcement learning model on anomaly detection. Entropy. 2021;23(3):274. 10.3390/e23030274.Suche in Google Scholar PubMed PubMed Central

[21] Lokhande MP, Patil DD, Patil LV, Shabaz M. Machine-to-machine communication for device identification and classification in secure telerobotics surgery. Secur Commun Netw. Vol. 2021, Hindawi Limited; 2021. p. 1–16. 10.1155/2021/5287514.Suche in Google Scholar

[22] Pajouh HH, Dastghaibyfard G, Hashemi S. Two-tier network anomaly detection model: a machine learning approach. J Intell Inf Syst. 2015;48(1):61–74. 10.1007/s10844-015-0388-x.Suche in Google Scholar

[23] Mehbodniya A, Alam I, Pande S, Neware R, Rane KP, Shabaz M, et al. Financial fraud detection in healthcare using machine learning and deep learning techniques. Secur Commun Netw. 2021;9293877. 10.1155/2021/9293877.Suche in Google Scholar

[24] Alzahrani AS, Shah RA, Qian Y, Ali M. A novel method for feature learning and network intrusion classification. Alex Eng J. 2020;59(3):1159–69. 10.1016/j.aej.2020.01.021.Suche in Google Scholar

[25] Mehbodniya A, Webber JL, Shabaz M, Mohafez H, Yadav K. Machine learning technique to detect Sybil attack on IoT based sensor network. IETE J Res. 2021;1–9. 10.1080/03772063.2021.2000509.Suche in Google Scholar

Received: 2023-04-18

Revised: 2023-05-16

Accepted: 2023-06-01

Published Online: 2023-07-12

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

https://doi.org/10.1515/nleng-2022-0297

Schlagwörter für diesen Artikel

anomaly; network layer; packets; DoS; IDS; attacks; machine learning; KDDcup99; KNN; K-Means.

Creative Commons

BY 4.0