Page 41 - Read Online
P. 41

Page 142                                                  Corizzo et al. J Surveill Secur Saf 2020;1:140-50  I  http://dx.doi.org/10.20517/jsss.2020.15












               Figure 1. Analytical workflow for machine learning-based intrusion detection in network traffic. Network sessions in the form of
               sequences of system calls are fed to a feature extraction method, which returns vector data that can be exploited in the modeling step
               by machine learning and deep learning algorithms. The outcome is a returned class for each session (normal, attack)


                                                                                         [7]
               One example of a pattern-based approach is the N-gram feature extraction method , which generates
               pattern data, converting each class into a two-dimensional array (or into a matrix) representation. In this
               representation, columns are grams, i.e., attributes, and rows are instances, i.e., traces. The entries in the
               matrix are the number of occurrences of each N-gram in the traces. Considering that the number of grams
               for any of the classes is very high compared to the number of instances, it is common to aim for a reduction
               in the number of attributes, taking into account the most frequent grams.

                                                                                  [5]
               Focusing on frequency-based approaches, the Subsequence Vector method  transforms a trace into a
               vector, where entries are calculated as the product between the system call and its frequency in the trace.
               The limitations of this approach consist in the generation of sparse vector representations and in the
                                                                                                       [22]
               independent treatment of each system call. Another similar method is known as Bag of System Calls ,
               which enumerates all system calls and transforms system traces into fixed-length vectors that contain the
               frequencies of each system call. One alternative to exploit frequency vectors is to apply weighting schemes
                                                                                          [5]
               to the observed frequencies. This type of approach is followed in the study by Xie et al. , which proposes
               the application of Term-Frequency and Inverse Document Frequency (TF-IDF) to extract normalized
               frequency vectors. Another alternative consists in performing dimensionality reduction to obtain a more
               compact vector representation that does not present sparsity issues. One example of this type of approach
                                                [6]
               can be found in the study by Xie et al. , which proposes the application of principal component analysis
               on frequency vectors.

               However, one major challenge in feature extraction is to represent the contextual information of system
               calls in traces effectively. Contextual information in sequential data with a complex structure can be often
               hidden and difficult to extract [23,24] , especially for pattern-based and frequency-based approaches that do
               not take into account the temporal dynamics of system calls in traces.


               In this paper, we propose a new feature extraction method for sequential network traffic data in the form
               of sequence of system calls. Following the success of state-of-the-art feature extraction methods inspired by
               Natural Language Processing (NLP), our method leverages a word embedding-based approach to extract
               contextual information that can be exploited in the subsequent classification step by any machine learning
               algorithm. To the best of our knowledge, this is the first study that presents feature extraction based on
               word embedding models and, in particular, presents a combination approach with TF-IDF and Word2Vec
               models. Moreover, in our study, we also investigated feature extraction based on Doc2Vec. We performed
               experiments to evaluate the effectiveness of different machine learning classifiers with our extracted
               features, and compared them with different state-of-the-art feature extraction methods in a number of
               different scenarios.


               METHODS
               In this section, we provide a brief overview on word embedding models and some examples of their
               successful application. Subsequently, we describe our proposed feature extraction method for intrusion
               detection in network traffic, based on word embedding models.
   36   37   38   39   40   41   42   43   44   45   46