Page 41 - Read Online

P. 41

Page 142 Corizzo et al. J Surveill Secur Saf 2020;1:140-50 I http://dx.doi.org/10.20517/jsss.2020.15

Figure 1. Analytical workflow for machine learning-based intrusion detection in network traffic. Network sessions in the form of
sequences of system calls are fed to a feature extraction method, which returns vector data that can be exploited in the modeling step
by machine learning and deep learning algorithms. The outcome is a returned class for each session (normal, attack)

[7]
One example of a pattern-based approach is the N-gram feature extraction method , which generates
pattern data, converting each class into a two-dimensional array (or into a matrix) representation. In this
representation, columns are grams, i.e., attributes, and rows are instances, i.e., traces. The entries in the
matrix are the number of occurrences of each N-gram in the traces. Considering that the number of grams
for any of the classes is very high compared to the number of instances, it is common to aim for a reduction
in the number of attributes, taking into account the most frequent grams.

[5]
Focusing on frequency-based approaches, the Subsequence Vector method transforms a trace into a
vector, where entries are calculated as the product between the system call and its frequency in the trace.
The limitations of this approach consist in the generation of sparse vector representations and in the
[22]
independent treatment of each system call. Another similar method is known as Bag of System Calls ,
which enumerates all system calls and transforms system traces into fixed-length vectors that contain the
frequencies of each system call. One alternative to exploit frequency vectors is to apply weighting schemes
[5]
to the observed frequencies. This type of approach is followed in the study by Xie et al. , which proposes
the application of Term-Frequency and Inverse Document Frequency (TF-IDF) to extract normalized
frequency vectors. Another alternative consists in performing dimensionality reduction to obtain a more
compact vector representation that does not present sparsity issues. One example of this type of approach
[6]
can be found in the study by Xie et al. , which proposes the application of principal component analysis
on frequency vectors.

However, one major challenge in feature extraction is to represent the contextual information of system
calls in traces effectively. Contextual information in sequential data with a complex structure can be often
hidden and difficult to extract [23,24] , especially for pattern-based and frequency-based approaches that do
not take into account the temporal dynamics of system calls in traces.

In this paper, we propose a new feature extraction method for sequential network traffic data in the form
of sequence of system calls. Following the success of state-of-the-art feature extraction methods inspired by
Natural Language Processing (NLP), our method leverages a word embedding-based approach to extract
contextual information that can be exploited in the subsequent classification step by any machine learning
algorithm. To the best of our knowledge, this is the first study that presents feature extraction based on
word embedding models and, in particular, presents a combination approach with TF-IDF and Word2Vec
models. Moreover, in our study, we also investigated feature extraction based on Doc2Vec. We performed
experiments to evaluate the effectiveness of different machine learning classifiers with our extracted
features, and compared them with different state-of-the-art feature extraction methods in a number of
different scenarios.

METHODS
In this section, we provide a brief overview on word embedding models and some examples of their
successful application. Subsequently, we describe our proposed feature extraction method for intrusion
detection in network traffic, based on word embedding models.

36 37 38 39 40 41 42 43 44 45 46