Page 42 - Read Online
P. 42
Corizzo et al. J Surveill Secur Saf 2020;1:140-50 I http://dx.doi.org/10.20517/jsss.2020.15 Page 143
Word embedding models are commonly adopted techniques for language modeling and feature learning in
NLP. These techniques map words and sentences into low dimensional feature vectors that can be exploited
[25]
by automated analytical tools. Examples of word embedding techniques include neural networks ,
[26]
probabilistic models , and approaches based on dimensionality reduction applied to a word co-occurrence
[27]
matrix .
Some word embedding techniques aim at extracting a vector representation for a word in terms of co-
[28]
occurring words, whereas others express a word in terms of vector of linguistic contexts . Recently,
particular interest has been devoted to the latter, since they attempt to characterize the semantics of words
and sentences, on the basis of the intuition by which a word is characterized according to the company it
keeps [29,30] .
One example of a groundbreaking technique in this field is represented by Word2Vec . Its ability to
[25]
represent implicit relationships between words has resulted in substantial machine learning improvements
on domains by contextual information. Some examples include the classification of news articles and
[32]
[31]
tweets , the analysis of biological data for the prediction of therapeutic peptides , the detection of
[34]
[33]
malware activity on Android devices , and the recommendation of contents in social networks .
Similarly to these studies, the method proposed in this paper leverages Word2Vec as a method to extract
word embeddings. However, none of these approaches applies Word2Vec to network traffic sessions in the
form of sequences of system calls. Our aim was to propose a pipeline that makes Word2Vec applicable to
data in this domain. In addition, we proposed an approach to weight the feature extracted according to its
importance.
The common result obtained in [31,33,34] is that performing the learning task on top of the newly extracted
data representation obtained by means of word embedding models, leads to an improved accuracy. The
motivation is that the newly extracted representation presents useful semantic features that were hidden
in the initial raw data representation, thus facilitating machine learning tools to perform classification
and improving the machine learning classification task. Following the same intuition, and motivated by
the success in different domains, our proposed method leverages a Word2Vec word embedding model to
extract contextual information that can be exploited in the subsequent classification step by any machine
learning algorithm. In particular, we exploit Word2Vec to obtain a -dimensional numerical embedding
vector that entails the semantic representation of a system call. Given a set of labeled traces , for which the
class attribute is known (normal or attack), we train a Word2Vec model to generate semantic vectors for all
traces . The feature extraction process from network traces exploiting a Word2Vec model is shown in
Figure 2. One alternative to Word2Vec is represented by Doc2Vec, which extracts a unique representation
for each document.
[35]
The novelty in this paper is to exploit Word2Vec in combination with a TF-IDF model . More
specifically, a TF-IDF model is trained to subsequently perform a weighted transformation of the semantic
representation of a system call extracted by Word2Vec. The rationale for the adoption of such a model is
that the representation vector of a trace should be weighted according to the saliency of the system calls
it contains. More precisely, system calls that appear in several traces are less indicative of the content of a
trace, whereas system calls that appear rarely, should be more discriminative. The TF-IDF weighting allows
us to capture these properties and give more weight to system calls that are frequent in a trace but rare in
the overall collection of traces.
Each trace is represented as a bag of system calls of arbitrary length. Next, the
Word2Vec model converts a system call into a semantic vector that is multiplied by the TF-IDF
score calculated as follows: