Page 39 - Read Online

P. 39

Corizzo et al. J Surveill Secur Saf 2020;1:140-50 Journal of Surveillance,
DOI: 10.20517/jsss.2020.15 Security and Safety

Original Article Open Access

Feature extraction based on word embedding
models for intrusion detection in network traffic

Roberto Corizzo , Eftim Zdravevski , Myles Russell , Andrew Vagliano , Nathalie Japkowicz 1
2
1
1
3
1 Department of Computer Science, American University, Washington, DC 20016, USA.
2 Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje 1000, North Macedonia.
3 Department of Computer Science, Northwestern University, Evanston, IL 60208, USA.
Correspondence to: Dr. Roberto Corizzo, Department of Computer Science, American University, 4400 Massachusetts Avenue
NW, Washington, DC 20016, USA. E-mail: rcorizzo@american.edu
How to cite this article: Corizzo R, Zdravevski E, Russell M, Vagliano A, Japkowicz N. Feature extraction based on word embedding
models for intrusion detection in network traffic. J Surveill Secur Saf 2020;1:140-50. http://dx.doi.org/10.20517/jsss.2020.15
Received: 30 Apr 2020 First Decision: 15 Jun 2020 Revised: 27 Jun 2020 Accepted: 17 Jul 2020 Available online: 28 Dec 2020

Academic Editor: Xiaofeng Chen Copy Editor: Cai-Hong Wang Production Editor: Jing Yu

Abstract
Aim: The analysis of network traffic plays a crucial role in modern organizations since it can provide defense
mechanisms against cyberattacks. In this context, machine learning algorithms can be fruitfully adopted to identify
malicious patterns in network sessions. However, they cannot be directly applied to a raw data representation
of network traffic. An active thread of research focuses on the design and implementation of feature extraction
techniques that aim at mapping raw data representations of network traffic sessions to a new representation that
can be processed by machine learning algorithms.

Methods: In this paper, we propose a feature extraction approach based on word embedding models. The
proposed approach extracts semantic features characterized by contextual information that is hidden in the raw
data representation.

Results: Our experiments conducted on three datasets showed that our feature extraction approach based on word
embedding models has the potential to increase the classification performance of conventional machine learning
algorithms that are applied to intrusion detection, and it is competitive with known feature extraction baselines in
the state-of-the-art.

Conclusion: This study shows that word embedding models can be used to carry out intrusion detection tasks
accurately. Feature extraction based on word embedding models requires a higher computational time than
simpler techniques, but leads to a higher accuracy, which is important for the identification of complex attacks.

© The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0
International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use,
sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license,
and indicate if changes were made.

www.jsssjournal.com

34 35 36 37 38 39 40 41 42 43 44