Page 45 - Read Online

P. 45

Page 146 Corizzo et al. J Surveill Secur Saf 2020;1:140-50 I http://dx.doi.org/10.20517/jsss.2020.15

Table 1. Descriptive statistics for all datasets considered in this study
Dataset Number of traces Normal traces Attack traces Imbalance ratio
ADFA-LD [16] 5,951 5,205 746 6.98
NGIDS-DS [17] 37,377 19,256 18,121 1.06
WWW2019 [18] 152,630 43,725 108,905 0.40
The reported imbalance ratio represents the proportion between the number of samples of the majority class and the number of
samples of the minority class

Doc2Vec
The goal of Doc2Vec is to create a numeric representation of a document, regardless of its length. While
word vectors represent the concept of a word, the document vector intends to represent the concept of a
document. We propose this model as an alternative to Word2Vec for feature extraction applied directly to
network traces.

Experimental setup
In our experiments, we assessed 5 feature extraction methods on 3 intrusion detection datasets. Descriptive
statistics for all datasets considered in this study are reported in Table 1. For evaluation, we adopted a
stratified 5-fold cross-validation scheme. The classification algorithm considered in our experiments
was Extremely Randomized Trees (ERT), a state-of-the-art ensemble learning method based on decision
trees. We emphasize that identifying the best machine learning algorithm is out of the scope of this paper.
However, the features extracted with our method are general and, in principle, any machine learning
algorithm can be used for the purpose of classification. Our aim was to show the potential of the features
extracted using a conventional machine learning algorithm for classification.

For Word2Vec and Doc2Vec, we used a standard value for the embedding size ( ). For ERT, we
used a standard configuration for the number of trees parameter ( ). Since the datasets considered
were imbalanced, we considered results in terms of macro precision, recall and F-score, to give the same
importance to both classes in the average scores. We also report results in terms of area under the ROC
curve (AUC). All the experimental results are reported in Table 2.

DISCUSSION
The results showed that word embedding-based feature extraction methods outperformed by a good
margin all competitors with the NGIDS-DS dataset and the WWW2019 dataset. In these cases, the
proposed variant of Word2Vec with TF-IDF weighting, appeared to obtain the best results. This behavior
was not observed with the ADFA-LD dataset, where word embedding-based methods appear sub-optimal.

One possible explanation is that, when most of the system calls appearing in network traces are sparsely
correlated, the semantic representation extracted by language models does not provide any advantage with
respect to simpler frequency-based and pattern-based methods. On the contrary, the high-dimensionality
of the new representation makes the classification task more difficult for the subsequent machine learning
algorithm.

Another aspect that could disadvantage word embedding representations is that of the imbalance ratio
between normal and attack traces. In fact, in the ADFA-LD dataset the imbalance ratio was 6.98, whereas
the NGIDS-DS and WWW2019 datasets were more balanced, having an imbalance ratio of 1.06 and 0.40,
[36]
respectively [Table 1]. This aspect is known to lead to increased challenges in classification tasks .

It is noteworthy that, among the word embedding-based methods, Doc2Vec performs poorly in all cases.
This unexpected result shows that the preferred data granularity for traces in the context of intrusion

40 41 42 43 44 45 46 47 48 49 50