Machine learning and cyber security
-
Sebastian Karius
, Mandy Knöchel
Abstract
Cyber Security has gained a significant amount of perceived importance when talking about the risks and challenges that lie ahead in the field of information technology. A recent increase in high-profile incidents involving any form of cyber criminality have raised the awareness of threats that were formerly often hidden from public perception, e.g., with openly carried out attacks against critical infrastructure to accompany traditional forms of warfare, extending those to the cyberspace. Add to that very personal experience of everyday social engineering attacks, which are cast out like a fishing net on a large scale, e.g., to catch anyone not careful enough to double-check a suspicious email. But as the threat level rises and the attacks become even more sophisticated, so do the methods to mitigate (or at least recognize) them. Of central importance here are methods from the field of machine learning (ML). This article provides a comprehensive overview of applied ML methods in cyber security, illustrates the importance of ML for cyber security, and discusses issues and methods for generating good datasets for the training phase of ML methods used in cyber security. This includes own work on the topics of network traffic classification, the collection of real-world attacks using honeypot systems as well as the use of ML to generate artificial network traffic.
About the authors

M.Sc. Sebastian Karius studied computer science at the Martin Luther University Halle-Wittenberg. At this university, he is a doctoral student and research assistant at the Institute of Computer Science. There he is researching in the field of network security. He further works as a research assistant at the Harz University of Applied Sciences, where he focuses on eID applications. Besides network security, his further research interests are web security, machine learning and computer networks.

M.Sc. Mandy Knöchel studied Computer Science at the Martin Luther University Halle-Wittenberg, Germany. She is a PhD student and research assistant at the Institute for Computer Science at the Martin Luther University Halle-Wittenberg. Her research interests include vulnerability analysis, web and network security, machine learning and cybersecurity education.

Sascha Heße has a master’s degree in Computer Science. He graduated from Martin Luther University Halle-Wittenberg, where he has since been working as a research assistant on various projects. His research interests include any forms of machine learning (esp. handwritten text recognition), security aspects of IoT devices, network traffic analysis as well as anything related to Digital Humanities.

M. Sc. Tim Reiprich studied computer science at Martin Luther University Halle-Wittenberg. Afterwards he worked there as a research assistant as part of an IT security project. Since this year he works at Otto-von-Goericke-University Magdeburg as a research assistant and is engaged in the forensic investigation of websites. His research interests are networks, IT security and machine learning.
-
Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
-
Conflict of interest: The authors declare no conflicts of interest regarding this article.
-
Research funding: This work was supported by the European Regional Development Fund (ERDF) and by the federal state of Saxony-Anhalt within the project Embedded-System-Security and Cryptography - Cyber-Sec-Verbund-LSA.
References
[1] S. Qiu, Q. Liu, S. Zhou, and C. Wu, “Review of artificial intelligence adversarial attack and defense technologies,” Appl. Sci., vol. 9, no. 5, Art. no. 5, 2019, https://doi.org/10.3390/app9050909.Search in Google Scholar
[2] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997, https://doi.org/10.1162/neco.1997.9.8.1735.Search in Google Scholar PubMed
[3] K. Cho, B. van Merrienboer, C. Gulcehre, et al.., “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, ACL, 2014, pp. 1724–1734.10.3115/v1/D14-1179Search in Google Scholar
[4] J. Chung, C. Gulcehre, K. H. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv, 2014. https://doi.org/10.48550/arXiv.1412.3555.Search in Google Scholar
[5] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, Cambridge, The MIT Press, 2016.Search in Google Scholar
[6] S. Rezaei and X. Liu, “Deep learning for encrypted traffic classification: an overview,” IEEE Commun. Mag., vol. 57, no. 5, pp. 76–81, 2019, https://doi.org/10.1109/mcom.2019.1800819.Search in Google Scholar
[7] I. H. Sarker, “Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective,” SN Compu. Sci., vol. 2, no. 3, p. 154, 2021, https://doi.org/10.1007/s42979-021-00535-6.Search in Google Scholar
[8] D. Xue, R. Ramesh, A. Jain, et al.., “OpenVPN is open to VPN fingerprinting,” in 31st USENIX Security Symposium (USENIX Security 22), 2022, pp. 483–500.Search in Google Scholar
[9] L. F. Carvalho, T. Abrão, L. de Souza Mendes, and M. L. Proença, “An ecosystem for anomaly detection and mitigation in software-defined networking,” Expert Syst. Appl., vol. 104, pp. 121–133, 2018, https://doi.org/10.1016/j.eswa.2018.03.027.Search in Google Scholar
[10] P. Xiao, W. Qu, H. Qi, and Z. Li, “Detecting DDoS attacks against data center with correlation analysis,” Comput. Commun., vol. 67, pp. 66–74, 2015, https://doi.org/10.1016/j.comcom.2015.06.012.Search in Google Scholar
[11] T. V. Phan, T. Van Toan, D. Van Tuyen, T. Thu Huong, and N. Huu Thanh, “OpenFlowSIA: an optimized protection scheme for software-defined networks from flooding attacks,” in 2016 IEEE Sixth International Conference on Communications and Electronics (ICCE), 2016, pp. 13–18.10.1109/CCE.2016.7562606Search in Google Scholar
[12] M. Lopez-Martin, B. Carro, A. Sanchez-Esguevillas, and J. Lloret, “Network traffic classifier with convolutional and recurrent neural networks for internet of things,” IEEE Access, vol. 5, pp. 18042–18050, 2017, https://doi.org/10.1109/access.2017.2747560.Search in Google Scholar
[13] V. F. Taylor, R. Spolaor, M. Conti, and I. Martinovic, “AppScanner: Automatic fingerprinting of smartphone apps from encrypted network traffic,” in 2016 IEEE European Symposium on Security and Privacy (EuroS P), 2016, pp. 439–454.10.1109/EuroSP.2016.40Search in Google Scholar
[14] A. Moore, D. Zuev, and M. Crogan, “Discriminators for use in flow-based classification,” in Department of Computer Science Research Reports, London, Queen Mary University of London, 2013.Search in Google Scholar
[15] S. Karius, M. Knöchel, and S. Wefel, “Training and validating of advanced flow-based network traffic classifiers under real-world conditions,” in 2022 27th Asia Pacific Conference on Communications (APCC), Jeju Island, IEEE, 2022, pp. 126–131.10.1109/APCC55198.2022.9943677Search in Google Scholar
[16] M. Lotfollahi, M. Jafari Siavoshani, R. S. Hossein Zade, and M. Saberian. “Deep packet: a novel approach for encrypted traffic classification using deep learning,” Soft Comput., vol. 24, no, 3, pp. 1999–2012, 2020, https://doi.org/10.1007/s00500-019-04030-2.Search in Google Scholar
[17] A. Malik, R. de Fréin, M. Al-Zeyadi, and J. Andreu-Perez, “Intelligent SDN traffic classification using deep learning: Deep-SDN,” in 2nd International Conference on Computer Communication and the Internet (ICCCI), 2020, pp. 184–189.10.1109/ICCCI49374.2020.9145971Search in Google Scholar
[18] P. Č. Rick Hofstede, B. Trammell, I. Drago, et al.., “Flow monitoring explained: from packet capture to data analysis with NetFlow and IPFIX,” Commun. Surv. Tutorials, IEEE, vol. 16, no. 4, pp. 2037–2064, 2014, https://doi.org/10.1109/comst.2014.2321898.Search in Google Scholar
[19] S. Miller, K. Curran, and L. Tom, “Detection of virtual private network traffic using machine learning,” Int. J. Wirel. Netw. Broadband Technol., vol. 9, no. 2, pp. 60–80, 2020, https://doi.org/10.4018/ijwnbt.2020070104.Search in Google Scholar
[20] M. Rigaki and S. Garcia, “Bringing a Gan to a knife-fight: adapting malware communication to avoid detection,” in 2018 IEEE Security and Privacy Workshops (SPW), 2018, pp. 70–75.10.1109/SPW.2018.00019Search in Google Scholar
[21] D. Vasan, M. Alazab, S. Wassan, H. Naeem, B. Safaei, and Q. Zheng, “IMCFN: image-based malware classification using fine-tuned convolutional neural network architecture,” Comput. Network., vol. 171, 2020, Art. no. 107138, https://doi.org/10.1016/j.comnet.2020.107138.Search in Google Scholar
[22] S. Jeon and J. Moon, “Malware-detection method with a convolutional recurrent neural network using opcode sequences,” Inf. Sci., vol. 535, nos. 1–15, pp. 1–15, 2020, https://doi.org/10.1016/j.ins.2020.05.026.Search in Google Scholar
[23] A. Cheng, “PAC-GAN: packet generation of network traffic using generative adversarial networks.” in 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), 2019, pp. 0728–0734.10.1109/IEMCON.2019.8936224Search in Google Scholar
[24] Y. Goldberg and O. Levy, “word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method,” arXiv:1402.3722, 2014. https://doi.org/10.48550/arXiv.1402.3722.Search in Google Scholar
[25] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv, 2013. https://doi.org/10.48550/arXiv.1301.3781.Search in Google Scholar
[26] W. Qiang, L. Yang, and H. Jin, “Efficient and robust malware detection based on control flow traces using deep neural networks,” Comput. Secur., vol. 122, 2022, Art. no. 102871, https://doi.org/10.1016/j.cose.2022.102871.Search in Google Scholar
[27] N. Daoudi, A. Kevin, T. F. Bissyandé, and J. Klein, “A two-steps approach to improve the performance of android malware detectors,” arXiv, 2022. https://doi.org/10.48550/arXiv.2205.08265.Search in Google Scholar
[28] G. Harris and M. Richardson, PCAP Capture File Format, Technical report, Internet Engineering Task Force, 2023. Available at: https://datatracker.ietf.org/doc/html/draft-ietf-opsawg-pcap.Search in Google Scholar
[29] A. Ferriyan, A. H. Thamrin, K. Takeda, and J. Murai, “Generating network intrusion detection dataset based on real and encrypted synthetic attack traffic,” Appl. Sci., vol. 11, no. 17, 2021, https://doi.org/10.3390/app11177868.Search in Google Scholar
[30] A. Kenyon, L. Deka, and D. Elizondo, “Are public intrusion datasets fit for purpose characterising the state of the art in intrusion event datasets,” Comput. Secur., vol. 99, 2020, Art. no. 102022, https://doi.org/10.1016/j.cose.2020.102022.Search in Google Scholar
[31] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A survey of network-based intrusion detection data sets,” Comput. Secur., vol. 86, pp. 147–167, 2019, https://doi.org/10.1016/j.cose.2019.06.005.Search in Google Scholar
[32] A. Thakkar and R. Lohiya, “A review of the advancement in intrusion detection datasets,” Procedia Comput. Sci., vol. 167, pp. 636–645, 2019, https://doi.org/10.1016/j.procs.2020.03.330.Search in Google Scholar
[33] Z. Yang, X. Liu, L. Tong, et al.., “A systematic literature review of methods and datasets for anomaly-based network intrusion detection,” Comput. Secur., vol. 116, p. 2022, 2022, https://doi.org/10.1016/j.cose.2022.102675.Search in Google Scholar
[34] J. Goh, S. Adepu, K. N. Junejo, and A. Mathur, “A dataset to support research in the design of secure water treatment systems,” in Critical Information Infrastructures Security, Supp. Lecture Notes in Computer Science, vol. 10242, Switzerland, Springer International Publishing, 2017, pp. 88–99.10.1007/978-3-319-71368-7_8Search in Google Scholar
[35] S. Garcia, A. Parmisano, and M. J. Erquiaga, IoT-23: A Labeled Dataset with Malicious and Benign IoT Network Traffic, Prague, Stratosphere Lab., 2020.Search in Google Scholar
[36] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the KDD CUP 99 data set,” in 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, 2009, pp. 1–6.10.1109/CISDA.2009.5356528Search in Google Scholar
[37] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set),” in 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, IEEE, 2015.10.1109/MilCIS.2015.7348942Search in Google Scholar
[38] B. Sangster, T. J. O’Connor, T. Cook, et al.., “Toward instrumenting network warfare competitions to generate labeled datasets,” in Proceedings of the 2nd Conference on Cyber Security Experimentation and Test (CSET’09), 2009.Search in Google Scholar
[39] A. Tongaonkar, R. Torres, M. Iliofotou, K. Ram, and A. Nucci, “Towards self adaptive network traffic classification,” Comput. Commun., vol. 56, pp. 35–46, 2015, https://doi.org/10.1016/j.comcom.2014.03.026.Search in Google Scholar
[40] J. Luis Guerra, C. Catania, and E. Veas, “Datasets are not enough: challenges in labeling network traffic,” Comput. Secur., vol. 120, 2022, Art. no. 102810, https://doi.org/10.1016/j.cose.2022.102810.Search in Google Scholar
[41] F. Gargiulo, C. Mazzariello, and C. Sansone, “Automatically building datasets of labeled IP traffic traces: a self-training approach,” Appl. Soft Comput., vol. 12, no. 6, pp. 1640–1649, 2012, https://doi.org/10.1016/j.asoc.2012.02.012.Search in Google Scholar
[42] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das, “The 1999 DARPA off-line intrusion detection evaluation,” Comput. Network., vol. 34, no. 4, pp. 579–595, 2000, https://doi.org/10.1016/s1389-1286(00)00139-0.Search in Google Scholar
[43] R. P. Lippmann, D. J. Fried, I. Graf, et al.., “Evaluating intrusion detection systems: the 1998 DARPA off-line intrusion detection evaluation,”in Proceedings – DARPA Information Survivability Conference and Exposition (DISCEX ’00), vol. 2, 2000, pp. 12–26.10.1109/DISCEX.2000.821506Search in Google Scholar
[44] M. Knöchel and S. Wefel, “Analysing attackers and intrusions on a high-interaction honeypot system,” in 2022 27th Asia Pacific Conference on Communications (APCC), IEEE, 2022, pp. 433–438.10.1109/APCC55198.2022.9943718Search in Google Scholar
[45] A. R. Abdou, D. Barrera, and C. Paul van Oorschot, “What lies beneath? Analyzing automated SSH bruteforce attacks,” in Technology and Practice of Passwords. PASSWORDS 2015. Lecture Notes in Computer Science, vol. 9551, Cham, Springer, 2016, pp. 72–91.10.1007/978-3-319-29938-9_6Search in Google Scholar
[46] N. Vincent, K. Mohamed, E. Alata, and M. Herrb, “Set-up and deployment of a high-interaction honeypot: experiment and lessons learned,” J. Comput. Virol., vol. 7, pp. 143–157, 2011, https://doi.org/10.1007/s11416-010-0144-2.Search in Google Scholar
[47] J. Bullock and J. T. Parker, Wireshark for Security Professionals: Using Wireshark and the Metasploit Framework, Indianapolis, John Wiley & Sons, 2017.10.1002/9781119183457Search in Google Scholar
[48] M. Buda, A. Maki, and A. M. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” Neural Network., vol. 106, pp. 249–259, 2018, https://doi.org/10.1016/j.neunet.2018.07.011.Search in Google Scholar PubMed
[49] L. Weng, “From GAN to WGAN,” arXiv, 2019. https://doi.org/10.48550/arXiv.1904.08994.Search in Google Scholar
[50] T. Reiprich, “Generierung von Netzwerkverkehr mithilfe von Wasserstein generative adversarial networks,” Master thesis, Martin Luther University Halle-Wittenberg, Halle/Saale, 2022.Search in Google Scholar
© 2023 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Editorial
- Machine learning applications
- Contributions to a thematic issue
- Machine learning and cyber security
- Artificial intelligence for molecular communication
- Machine learning in run-time control of multicore processor systems
- Machine learning in sensor identification for industrial systems
- Wildfire prediction for California using and comparing Spatio-Temporal Knowledge Graphs
- Machine learning in computational literary studies
- Machine learning in AI Factories – five theses for developing, managing and maintaining data-driven artificial intelligence at large scale
Articles in the same Issue
- Frontmatter
- Editorial
- Machine learning applications
- Contributions to a thematic issue
- Machine learning and cyber security
- Artificial intelligence for molecular communication
- Machine learning in run-time control of multicore processor systems
- Machine learning in sensor identification for industrial systems
- Wildfire prediction for California using and comparing Spatio-Temporal Knowledge Graphs
- Machine learning in computational literary studies
- Machine learning in AI Factories – five theses for developing, managing and maintaining data-driven artificial intelligence at large scale