Page 42 - Read Online
P. 42

Corizzo et al. J Surveill Secur Saf 2020;1:140-50  I  http://dx.doi.org/10.20517/jsss.2020.15                                                 Page 143

               Word embedding models are commonly adopted techniques for language modeling and feature learning in
               NLP. These techniques map words and sentences into low dimensional feature vectors that can be exploited
                                                                                                       [25]
               by automated analytical tools. Examples of word embedding techniques include neural networks ,
                                [26]
               probabilistic models , and approaches based on dimensionality reduction applied to a word co-occurrence
                     [27]
               matrix .
               Some word embedding techniques aim at extracting a vector representation for a word in terms of co-
                                                                                               [28]
               occurring words, whereas others express a word in terms of vector of linguistic contexts . Recently,
               particular interest has been devoted to the latter, since they attempt to characterize the semantics of words
               and sentences, on the basis of the intuition by which a word is characterized according to the company it
               keeps [29,30] .

               One example of a groundbreaking technique in this field is represented by Word2Vec . Its ability to
                                                                                            [25]
               represent implicit relationships between words has resulted in substantial machine learning improvements
               on domains by contextual information. Some examples include the classification of news articles and
                                                                                        [32]
                     [31]
               tweets , the analysis of biological data for the prediction of therapeutic peptides , the detection of
                                                                                                       [34]
                                                 [33]
               malware activity on Android devices , and the recommendation of contents in social networks .
               Similarly to these studies, the method proposed in this paper leverages Word2Vec as a method to extract
               word embeddings. However, none of these approaches applies Word2Vec to network traffic sessions in the
               form of sequences of system calls. Our aim was to propose a pipeline that makes Word2Vec applicable to
               data in this domain. In addition, we proposed an approach to weight the feature extracted according to its
               importance.


               The common result obtained in [31,33,34]  is that performing the learning task on top of the newly extracted
               data representation obtained by means of word embedding models, leads to an improved accuracy. The
               motivation is that the newly extracted representation presents useful semantic features that were hidden
               in the initial raw data representation, thus facilitating machine learning tools to perform classification
               and improving the machine learning classification task. Following the same intuition, and motivated by
               the success in different domains, our proposed method leverages a Word2Vec word embedding model to
               extract contextual information that can be exploited in the subsequent classification step by any machine
               learning algorithm. In particular, we exploit Word2Vec to obtain a  -dimensional numerical embedding
               vector that entails the semantic representation of a system call. Given a set of labeled traces  , for which the
               class attribute is known (normal or attack), we train a Word2Vec model to generate semantic vectors for all
               traces    . The feature extraction process from network traces exploiting a Word2Vec model is shown in
               Figure 2. One alternative to Word2Vec is represented by Doc2Vec, which extracts a unique representation
               for each document.

                                                                                                  [35]
               The novelty in this paper is to exploit Word2Vec in combination with a TF-IDF model . More
               specifically, a TF-IDF model is trained to subsequently perform a weighted transformation of the semantic
               representation of a system call extracted by Word2Vec. The rationale for the adoption of such a model is
               that the representation vector of a trace should be weighted according to the saliency of the system calls
               it contains. More precisely, system calls that appear in several traces are less indicative of the content of a
               trace, whereas system calls that appear rarely, should be more discriminative. The TF-IDF weighting allows
               us to capture these properties and give more weight to system calls that are frequent in a trace but rare in
               the overall collection of traces.

               Each trace     is represented as a bag of system calls           of arbitrary length. Next, the
               Word2Vec model converts a system call    into a semantic vector    that is multiplied by the TF-IDF
               score        calculated as follows:
   37   38   39   40   41   42   43   44   45   46   47