Page 43 - Read Online
P. 43

Page 144                                                  Corizzo et al. J Surveill Secur Saf 2020;1:140-50  I  http://dx.doi.org/10.20517/jsss.2020.15






























               Figure 2. Graphical representation of feature extraction based on a Word2Vec model

                                      ,

               where      is the frequency of the system call   in the trace  , and    is the frequency of the system
               call   in the entire collection. To extract a single vector representation for each trace, we exploit the “additive
               compositionality” property of word embeddings. This property guarantees that similar words appear close
               to each other in the feature space, and that the sum of their embedding vector representation resembles an
               AND concatenation. By analogy, in our domain, if two traces (   appear in the same context, their sum
               vectors obtained as the sum of the embedding vectors of the corresponding system calls will still be close to
               each other. Therefore, the final vector representation   of a trace   is computed as:


                                     .

               Following this process, we obtain a new dataset   , consisting of the semantic vector representation
                  for each labeled trace  in  . This dataset can be used to train any machine learning algorithm.

               Consequently, during the prediction phase, the previously trained Word2Vec and TF-IDF models are
               exploited to extract features for a new collection of unlabeled traces  . The machine learning algorithm of
               choice can exploit the extracted representation to predict the class attribute of each trace   . The overall
               feature extraction process with Word2Vec and TF-IDF is shown in Figure 3.


               In summary, the Word2Vec and TF-IDF models are trained with a collection of labeled traces  ,
               represented as a bag of system calls (Step 1). The outputs of these models are combined to extract a
               new representation (    from both labeled and unlabeled traces   . The representation extracted
               by Word2Vec is a vector for each system call. Simultaneously, the TF-IDF model extracts the weight
               corresponding to each system call (Step 2). The multiple vectors that represent the different system calls in
               a trace are subsequently calculated as the weighted sum of the system calls vector representations extracted
               by Word2Vec and the TF-IDF weights (Step 3). A machine learning model is trained on labeled traces after
               feature extraction   and predicts the class of unlabeled traces  after the feature extraction process.


               In the following section, we present our experiments aimed at comparing the classification accuracy with
               our proposed feature extraction technique in comparison with state-of-the-art feature extraction methods.
   38   39   40   41   42   43   44   45   46   47   48