Analyzing SQL payloads using logistic regression in a big data environment

Omar Salah F. Shareef; Rehab Flaih Hasan; Ammar Hatem Farhan

doi:10.1515/jisys-2023-0063

Article Open Access

Analyzing SQL payloads using logistic regression in a big data environment

Omar Salah F. Shareef , Rehab Flaih Hasan and Ammar Hatem Farhan

Published/Copyright: September 5, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 32 Issue 1

Abstract

Protecting big data from attacks on large organizations is essential because of how vital such data are to organizations and individuals. Moreover, such data can be put at risk when attackers gain unauthorized access to information and use it in illegal ways. One of the most common such attacks is the structured query language injection attack (SQLIA). This attack is a vulnerability attack that allows attackers to illegally access a database quickly and easily by manipulating structured query language (SQL) queries, especially when dealing with a big data environment. To address these risks, this study aims to build an approach that acts as a middle protection layer between the client and database server layers and reduces the time consumed to classify the SQL payload sent from the user layer. The proposed method involves training a model by using a machine learning (ML) technique for logistic regression with the Spark ML library that handles big data. An experiment was conducted using the SQLI dataset. Results show that the proposed approach achieved an accuracy of 99.04, a precision of 98.87, a recall of 99.89, and an F-score of 99.04. The time taken to identify and prevent SQLIA is 0.05 s. Our approach can protect the data by using the middle layer. Moreover, using the Spark ML library with ML algorithms gives better accuracy and shortens the time required to determine the type of request sent from the user layer.

Keywords: big data; logistic regression; spark ML; SQL injection.

1 Introduction

Security has become a crucial component when developing web apps because of the massive amount of data sent between businesses and the rising number of everyday users in various areas. Therefore, enterprises’ big data require a web application architecture that can detect and stop application flaws. The Open Web Application Security Project considers structured query language injection attack (SQLIA) among the most dangerous threats to enterprise-scale databases [1,2].

The big data discipline uses a multi-scientific approach to analyzing and forecasting data, combining computer science, mathematical modeling, and statistics. Access to data and methods for working with it have emerged as critical factors. Companies may reliably manage big data by using and implementing artificial intelligence and machine learning (ML) techniques [3].

A growing number of security risks are associated with the widespread use and storage of data online. These risks arise from the proliferation of attacks that try to gain unauthorized access to the private information of people and organizations [4].

SQLIA is among the most harmful assaults on database servers. By taking advantage of security holes, attackers may compromise users’ and businesses’ data by tampering with, reading, erasing, or making copies of it [5].

Structured query language (SQL) injection flaws exist in every parameter a program uses to send an attack to a database. An attacker may use various techniques in this kind of attack to gain unauthorized entry to databases and extract information. The injection mechanism describes these procedures. The techniques used are primarily divided into four categories: injection through cookies, injection through user input, injection through server variables, and second-order or stored injections [6].

The increasing data exchange between individuals and institutions and daily transactions in various fields have made data vulnerable to many attacks, such as illegal access. One of the most well-known attacks is SQLIA. Standard methods for detecting and preventing these attacks can provide good results when dealing with small data. However, these approaches do not work effectively with big data. Hence, another approach must be developed to deal with big data and detect attacks against them.

This study was conducted to overcome the problems in previous works, which often did not mention the time taken during the testing phase to detect the type of request sent by the user and whether it contains harmful or benign payloads. In addition, the data protection method used when the protection model is alongside the user layer or the data layer was not addressed. Accordingly, the aims of this study are as follows:

To create a layer that separates the user layer from the data layer to increase data protection and prevent unauthorized access.
To protect user and institutional data, ensuring confidentiality, integrity, and prompt availability of data.
To reduce the time required to classify the payloads sent to the data layer.

In this research, we presented an approach for detecting the real-time SQLIA by applying a logistic regression (LR) approach in a big data environment using a distributed Spark ML system that acts as a middle protection layer between the client and the database server to increase data protection and the classification accuracy of the sent payload. The contributions of this model are as follows:

The first contribution of this model is that it proposes an approach that uses a middle layer between the data layer and the user layer to receive SQL payloads from the user layer and analyze them to classify whether the request is harmful or benign by using the LR approach with the big data framework Spark ML library. This layer prevents users from directly accessing data, thereby further protecting the data layer from unauthorized access and from violations of the principles of basic information security.
The second contribution is that the time taken to classify the request type is reduced by using the Spark ML library because Spark ML works in the memory in a distributed way, thereby reducing the time taken to classify the payload type.

The subsequent sections are organized as follows. The second section of this study will address the proposed methodology for identifying and mitigating SQLIA within a big data setting. The third section presents the outcomes. The final section provides the conclusions.

2 Methodology

The proposed framework for detecting SQL injection attacks in a big data environment consists of three layers. The first layer is the user layer, through which user requests are sent. The second layer represents the protection layer, which includes the proposed framework for classifying requests sent from the first layer. This layer consists of several stages, as illustrated in the following.

Stage 1: Data that contain both malicious and benign payloads are collected to train the proposed model.

Stage 2: Pre-processing is applied to the acquired data.

Stage 3: The acquired data are divided into two sets for training and testing.

Stage 4: The first dataset is used to train the proposed model using LR.

Stage 5: The second dataset is used to test the model.

Stage 6: The model is evaluated using a confusion matrix and a set of metrics to measure the performance efficiency.

In the second layer, the LR approach determines whether incoming order loads are harmful or not. The third layer represents the data layer that needs to be protected from attacks by unauthorized individuals who try to access this data.

The LR approach contains three variables. The first variable is “features,” which represents the condition feature. The second variable is “label_col,” which represents the decision feature. The third variable represents the maxiteration, which is used as a time criterion to stop training the model. The flow of the experiment is shown in Figure 1.

Figure 1

Proposed system to classify submitted queries.

The proposed approach is described in the following subsections.

2.1 SQL-i datasets

Collecting data pertaining to the research subject matter is essential in developing an ML approach. This study used a dataset comprising 109,518 instances categorized into two groups on the basis of their payloads: those with malicious and non-malicious intentions. The object in question is divided into two parts.

The following table provides an overview of the dataset (Table 1).

Table 1

Summary of the dataset

Name of dataset	Number of cases	Learning step	Testing step	Normal	Malicious
SQLIA	109,518	76,670	32,848	52,213	57,305

The dataset used in this research is described in the Figure 2.

Figure 2

Dataset before pre-processing.

The data were collected from the Kaggle website [7]. The challenges initially contained the dataset containing 109,518 samples that contained harmful and benign payloads, but the data were inaccurate because a filtering process was performed to delete entries that contained data. The data pre-processing process cannot be applied to convert the initial dataset into a dataset that can be applied by ML algorithms.

2.2 Data pre-processing

In the second stage, the dataset is pre-processed and prepared to be used by the learning techniques. Data preparation aims to reduce data volume; create connections between datasets; standardize data; and eliminate outliers, duplicates, and missing values [8].

CountVectorizer is a tool that is used to pre-process the dataset, where textual data are transformed into numerical vectors. For instance, the terms in articles could reference the characteristics of a specific class, and a single vector could furnish all the phrases. This process is known as vectorization. Common text routing technologies include CountVectorizer and TF-IDFVectorizer. These vectors convert textual data into vector format [9].

CountVectorizer is frequently used to derive numerical properties from texts and generate class features. In the instruction text, only frequently occurring words are taken into account. Using matrix fit transform, CountVectorizer converts the text into a word occurrence matrix, enabling users to calculate the frequency of each word [10].

Algorithm 1: CountVectorizer for data pre-processing
Input: Dataset prior to initial processing
Output: array of word
Begin:
Stage 1: Transform text into a collection of words by using CountVectorizer.
Stage 2: Eliminate frequently used terms.
Stage 3: Remove the least frequently used terms.
Stage 4: Eliminate all end phrases.
Stage 5: Convert every word into lowercase letters.
Stage 6: Arrange the vocabulary in ascending order.
If the term is present, then it is indicated by a 1 in the text; if it is absent, then it is indicated by a 0.
Stage 7: Repeat stages 1–6 to convert the text dataset into numbers.
End

The example below shows the process of converting text into numbers using an algorithm to convert text into an array of words.

Sample 1 “Convert text to an array using CountVectorizer using text datasets”

Sample 2 “Convert text to an array using” (Table 2)

Table 2

Dataset after pre-processing

Sample	Array	Convert	CountVectorizer	Datasets	Text	Using
Sample 1	1	1	1	1	1	1
Sample 2	1	1	0	0	1	1

2.3 Training and testing

The third stage in developing an ML approach is to divide the data into two distinct categories: the training group and the testing group. The holdout method was used in this investigation, with 80% of the dataset used for training and 20% for testing and evaluation [11]. The dataset used is a balanced set, containing 45,051 benign payloads and 40,923 malicious payloads out of the total dataset of 85,974.

2.4 Prediction approach

The fourth stage in developing an ML approach is selecting a classification approach for SQL requests sent to web-based databases. This study uses a supervised ML approach. This method categorizes requests into two categories (0 and 1), which represent harmless and harmful requests. In addition, this technique aims to create a classification that accurately describes the relationship between dependent and independent variables.

The effectiveness of the LR method is determined by the linear regression strategy in the following equation:

(1) j = h 0 ( i ) = θ T i .

The use of equation (1) may need to be more efficient when dealing with binary numbers. By using equation (2), we may determine whether the communicated request will have a harmful payload (probability 1) or a harmless payload (probability 0) [12].

p ( j = 1 | i ) = h θ ( i ) = 1 1 + exp ( − θ T i ) = σ ( θ T i )

(2) p ( j = 0 | i ) = 1 − p ( j = 1 | i ) = 1 − h θ ( i ) .

Equation (3), sometimes referred to as the sigmoid function, allows us to keep the value of θ T i within the range [0, 1]. Then, we look for a number such that p ( j = 1 | i ) = h θ ( i ) , i.e., p ( j = 0 | i ) , is large when i belongs to the “0” class and small when i belongs to the “1” class [13, 14, 15].

(3) σ ( t ) = 1 ( 1 + e − t ) .

The LR regression algorithm was chosen to train and test the model. This model was chosen because of its highly accurate results and the short time it takes to classify benign and harmful loads.

The variable that represents maxiteration was used as a time criterion to stop training the model, where maxiteration = 100 was chosen, which gave the best accuracy and the shortest time.

Two variables were used to build the model, where the first variable “features” represents the condition feature, which contains both harmful and benign payloads. We pre-process these features using CountVectorization to extract the desired features after removing the least and most frequent words that do not affect the model training results. The “Data pre-processing” subsection in the Methodology provided an example of how to capture the desired features. The features obtained from the pre-processing results will be used as features in model training.

The second variable “label_col” represents the decision feature.

2.5 Performance evaluation measures of prediction approach

During the final stage of developing the prediction approach, various metrics such as accuracy, time, precision, and recall were used to evaluate the approach and determine the outcomes.

A confusion matrix with a variety of values was used to calculate these measurements. Table 3 shows the widespread use of the confusion matrix, which consists of four classes, namely, false positive (FP), false negative (FN), true negative (TN), and true positive (TP).

Table 3

Confusion matrix

		Predicted class
		Class X	Class Y
True class	Class X	TN	FP
True class	Class Y	FN	TP

TP: This term is used to describe malicious payloads that the model has correctly predicted.

FN: This term relates to instances in which the prediction approach categorized a benign case as harmful.

FP: This term relates to instances where the prediction approach categorizes harmful conditions as benign.

TN: This term refers to instances that were identified as benign by the prediction approach and are, in fact, benign [15,16,17].

The following equations represent the metrics used to assess the approach and determine its performance efficacy.

Accuracy: It represents the total number of accurate predictions, both TP and TN. It is mathematically expressed as follows:

(4) Accuracy = Count of accurately categorized observations ( TP + TN ) Total number of instances ( TP + TN + FP + FN ) × 100 .

Precision: It displays the proportion of TP to the sum of TP and FP. It is mathematically expressed as follows:

(5) Precision = No . of true positives ( TP ) No . of true positive + false positive ( TP + FP ) × 100 .

Recall: It displays the ratio of TP to the total TP and FN. It is mathematically expressed as follows:

(6) Recall = No . of true positives ( TP ) No . of true positive + false negative ( TP + FN ) × 100 .

F 1-score: This is the proportional mean of precision and recall. It is mathematically expressed as follows [18]:

(7) F 1 - score = 2 × Precision × Recall Precision + Recall × 100 .

3 Results and discussions

This section presents the results of using the LR approach when dealing with a big data environment, which can be used to determine whether the payload sent by the user contains malicious or benign payloads.

Building an approach using ML or any other system requires providing a set of basic hardware and software requirements. Tables 4 and 5 describe the basic requirements used in this study.

Table 4

Software requirement

Software requirement
System type	64-bit operating system, x64-based processor
Programming language	Python programming languages (Spyder [Anaconda3])

Table 5

Hardware requirement

Hardware requirement
Processor	Intel(R) Core (TM) i7-5500U CPU @ 2.40 GHz
Installed RAM	8 GB
Hard disk	500 GB
GPU	AMD Radeon Graphics Processor HD (8500 M)

However, the results were obtained by using two experiments for training and testing the model. The purpose is to achieve the best classification accuracy and the shortest time for classifying the type of loads.

3.1 First experiment

The first experiment was conducted using a dataset containing 85,974 malicious and benign payloads divided into two sections. The first section consists of 45,051 payloads representing benign loads, and the second section consists of 40,923 payloads representing malicious loads. As for the data division method, the holdout method was used, where 70% of the dataset was chosen for training, and the remaining portion was used for testing and evaluation.

3.2 Second experiment

The second experiment was conducted using a dataset containing 85,974 malicious and benign payloads divided into two sections. The first section, which represents the benign payloads, consists of 45,051 samples, while the second section, representing the malicious payloads, consists of 40,923 samples. As for the data division method, the holdout method was used, where 80% of the dataset was chosen for training, and the remaining portion was used for testing and evaluation.

Table 6 shows that the accuracy of the LR approach reached 99.04.

Table 6

Result of first experiment

Seq	Name of parameter	Value
1.	Time complexity	0.10 s
2.	Accuracy	98.025
3.	Precision	98.055
4.	Recall	98.025
5.	F-score	98.02
6.	Training dataset	59,938
7.	Test dataset	26,036

The results of the second experiment were chosen because they provided better accuracy and a shorter testing time.

The LR approach accurately classified the process of sending SQL queries to databases used by web applications. The value of TP and TN, which is 99.04%, indicates that malicious and benign payloads may be discriminated with high accuracy. The detection of the query type took 0.05 s only (Table 7).

Table 7

Result of second experiment

Seq	Name of parameter	Value
1.	Time complexity	0.05 s
2.	Accuracy	99.04
3.	Precision	98.18
4.	Recall	99.89
5.	F-score	99.04
6.	Training dataset	68,604
7.	Test dataset	17,370

The following table shows the results of the comparison between previous studies and this study (Table 8).

Table 8

Result of comparison between previous studies and this study

Ref	Model	Accuracy	Time complexity	Dataset size
[19]	SVM	98.6	Non	181,303
[20]	Neural network of direct signal propagation	95	Non	30,233
[21]	Long short-term memory (LSTM)	95.2	37.1494 s	42,212
[22]	Support vector machine	94.92	3.98 s	20,474
	Gradient boosting	94.27
	Naive Bayes classifier	70.79
	REGEX classifier	97.48
[23]	Naive Bayes	95	Non sec
	LR	92
	CNN	97
	SVM	79
	Passive aggressive	79
[24]	CNN-BiLSTM	98	45 s	4,200
[24]	Proposed model	99.04	0.05	85,974

Standard methods for detecting and preventing these attacks can obtain optimal results when dealing with small data. However, these methods are not optimal when used for big data. The significant feature of the proposed approach when implementing Spark ML and the Spark framework is that it can process large-scale datasets efficiently and reliably. Scalability is achieved by distributing processing tasks, thus enabling the handling of larger, more complex datasets. Achieving high performance requires using memory resources and executing operations simultaneously. However, the limitation of this work is that the proposed approach has difficulty dealing with large datasets when applying ML models because they require higher computational power. In addition, some of the datasets used contain instances that cannot be processed and handled by ML algorithms because the dataset must be filtered before it can be used by the proposed approach.

4 Conclusion

This work presented a method for detecting SQL attacks using the LR approach in a big data environment. The dataset contained malicious and benign SQL payloads. The proposed approach then classified user queries as containing either malicious or benign payloads. Several experiments were conducted, and the performances were compared. The proposed method achieved the highest accuracy and the shortest running time when handling large datasets in every experiment.

One of the main contributions of this work is that the proposed method prevents users from directly accessing the data, and it maintains the data’s confidentiality, integrity, and availability. This protection is achieved by creating a separation layer, which applies an approach trained on a large dataset for classifying new payloads sent by the user, thus providing additional protection for the data layer before the request is sent by the user layer. The second contribution is that the time required to classify the query type submitted by the user is reduced by using the Spark ML library. Spark ML works in the memory in a distributed manner, thereby reducing the time required to classify the payload type. Reducing the time to classify the type of request is essential when dealing with big data because it enables timely access to the data and ensures that the data are available to users and organizations. This work provides high accuracy and takes a short time to classify requests, thereby achieving high data protection and maintaining the confidentiality, integrity, and availability of data. However, the proposed approach can classify the SQL-type attack only. Future work will involve building a model that classifies more than one type of attack such as cross-site scripting attacks or DDOS attacks using the LSTM algorithm.

Author contributions: Omar Salah F. Shareef conceived of the presented idea. Rehab Flaih Hasan and Ammar Hatem Farhan designed and performed the experiments, derived the models, and analyzed the data. Omar Salah F. Shareef supervised the project. Ammar Hatem Farhan wrote the manuscript in consultations Omar Salah F. Shareef and Rehab Flaih Hasan. All authors discussed the results and contributed to the final manuscript.
Conflict of interest: Authors state no conflict of interest.
Data availability statement: The data that support the findings of this study are openly available on [Kaggle website] at https://www.kaggle.com/datasets/gambleryu/biggest-sql-injection-dataset?resource=download., reference number [7].

References

[1] Farhan AH, Hasan RF. Detection SQL injection attacks against web application by using K-nearest neighbors with principal component analysis. In: Proceedings of Data Analytics and Management: ICDAM 2022. Springer; 2023. p. 631–42.10.1007/978-981-19-7615-5_52Search in Google Scholar

[2] Durai KN, Subha R, Haldorai A. A novel method to detect and prevent SQLIA using ontology to cloud web security. Wirel Pers Commun. 2021;117(4):2995–3014. 10.1007/s11277-020-07243-z.Search in Google Scholar

[3] Haldorai A, Devi S, Joan R, Arulmurugan L. Big data in intelligent information systems. Mob Netw Appl. 2022;October 2021;27:997–9. 10.1007/s11036-021-01863-w.Search in Google Scholar

[4] Awan MJ, Farooq U, Babar HM, Yasin A, Nobanee H, Hussain M, et al. Real-time ddos attack detection system using big data approach. Sustain. 2021;13(19):1–19. 10.3390/su131910743.Search in Google Scholar

[5] Alghawazi M, Alghazzawi D, Alarifi S. Detection of SQL injection attack using machine learning techniques: A systematic literature review. J Cybersecur Priv. 2022;2(4):764–77. 10.3390/jcp2040039.Search in Google Scholar

[6] Crespo-Martínez IS, Campazas-Vega A, Guerrero-Higueras ÁM, Riego-DelCastillo V, Álvarez-Aparicio C, Fernández-Llamas C. SQL injection attack detection in network flow data. Comput Secur. 2023;127:103093. 10.1016/j.cose.2023.103093.Search in Google Scholar

[7] https://www.kaggle.com/datasets/gambleryu/biggest-sql-injection-dataset? resource = download.Search in Google Scholar

[8] Alasadi SA, Bhaya WS. Review of data preprocessing techniques in data mining. J Eng Appl Sci. 2017;12(16):4102–7.Search in Google Scholar

[9] El Rifai H, Al Qadi L, Elnagar A. Arabic text classification: the need for multi-labeling systems. Neural Comput App. 2022;34(2):1135–59. 10.1007/s00521-021-06390-z.Search in Google Scholar PubMed PubMed Central

[10] Yang JS, Zhao CY, Yu HT, Chen HY. Use GBDT to predict the stock market. Procedia Comput Sci. 2020;174(2019):161–71. 10.1016/j.procs.2020.06.071.Search in Google Scholar

[11] Rafało M. Cross validation methods: Analysis based on diagnostics of thyroid cancer metastasis. ICT Express. 2022;8(2):183–8. 10.1016/j.icte.2021.05.001.Search in Google Scholar

[12] Arif ZH, Cengiz K. Severity Classification for COVID-19 Infections based on Lasso-Logistic Regression Model. Int J Mathematics, Statistics, Computer Sci. 2023;1:25–32. 10.59543/ijmscs.v1i.7715.Search in Google Scholar

[13] Yassine S, Stanulov A. A comparative analysis of machine learning algorithms for the purpose of predicting Norwegian air passenger traffic. Int J Mathematics, Statistics, Computer Sci. 2023;2:28–43. 10.59543/ijmscs.v2i.7851.Search in Google Scholar

[14] Zhu C, Idemudia CU, Feng W. Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques. Inform Med Unlocked. 2019;17:100179. 10.1016/j.imu.2019.100179.Search in Google Scholar

[15] Shah K, Patel H, Sanghvi D, Shah M. A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augmented Hum Res. 2020;5(1):1–16. 10.1007/s41133-020-00032-0.Search in Google Scholar

[16] Shaukat K, Luo S, Varadharajan V, Hameed IA, Xu M. A survey on machine learning techniques for cyber security in the last decade. IEEE Access. 2020;8:222310–54. 10.1109/ACCESS.2020.3041951.Search in Google Scholar

[17] Abuhaiba ISI, Dawoud HM. Combining different approaches to improve Arabic text documents classification. Int J Intell Syst Appl. 2017;9(4):39–52. 10.5815/ijisa.2017.04.05.Search in Google Scholar

[18] Alarfaj FK, Khan NA. Enhancing the performance of SQL injection attack detection through probabilistic neural networks. Appl Sci. 2023 Mar 29;13(7):4365.10.3390/app13074365Search in Google Scholar

[19] Uwagbole SO, Buchanan WJ, Fan L. Applied machine learning predictive analytics to SQL injection attack detection and prevention. Proc. IM 2017 - 2017 IFIP/IEEE Int. Symp. Integr. Netw. Serv. Manag; 2017. p. 1087–90. 10.23919/INM.2017.7987433.Search in Google Scholar

[20] Hubskyi O, Babenko T, Myrutenko L, Oksiiuk O. Detection of SQL injection attack using neural networks. Advances in Intelligent Systems and Computing. Vol. 1265 AISC. 2021. p. 277–86. 10.1007/978-3-030-58124-4_27.Search in Google Scholar

[21] Tang P, Qiu W, Huang Z, Lian H, Liu G. Detection of SQL injection based on artificial neural network. Knowl Syst. 2020;190:105528. 10.1016/j.knosys.2020.105528.Search in Google Scholar

[22] Kranthikumar B, Velusamy RL. SQL injection detection using REGEX classifier. J Xi’an Univ Archit Technol. 2020;7(6):800–9.Search in Google Scholar

[23] Joshi A, Geetha V. SQL Injection detection using machine learning. In: 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies, ICCICCT 2014; 2014. p. 1111–5. 10.1109/ICCICCT.2014.6993127.Search in Google Scholar

[24] Aggarwal P, Kumar A, Michael K, Nemade J, Sharma S. Random decision forest approach for mitigating SQL injection attacks. In: 2021 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT). 2021. p. 1–5.10.1109/CONECCT52877.2021.9622689Search in Google Scholar

Received: 2023-05-15

Revised: 2023-07-18

Accepted: 2023-07-28

Published Online: 2023-09-05

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jisys-2023-0063

Keywords for this article

big data; logistic regression; spark ML; SQL injection.

Creative Commons

BY 4.0