Page 40 - Read Online
P. 40
Corizzo et al. J Surveill Secur Saf 2020;1:140-50 I http://dx.doi.org/10.20517/jsss.2020.15 Page 141
Keywords: Feature extraction, intrusion detection, network traffic, anomaly detection, word embeddings, language
models
INTRODUCTION
Intrusion detection systems (IDS) play a fundamental role in modern organizations, providing defense
mechanisms against cyberattacks. IDS monitor and analyze the traffic using different sources of
information, with the purpose of identifying intrusions and other security breaches. Differently than
firewalls, which limit access between networks to prevent intrusions, IDS evaluate a potential intrusion
when it takes place, signal an alarm, and may terminate the connection. The most popular categories of IDS
[1]
include network-based IDS and host-based IDS (HIDS) . The former analyze network packets on an entire
[2,3]
subnet , whereas the latter consist of an agent on a host that analyzes system calls, file system changes, and
[4-7]
logs . In this study, we focused on HIDS and, more specifically, machine learning-based tools to support
it. One opportunity in this domain consists in monitoring and analyzing network traffic represented in the
form of network sessions, also known as traces . One of the most popular data representations for traces is
[8]
[9]
that known as sequence of system calls , i.e., a sequence of requests that programs submit to the operating
system kernel to perform any action. The ordering, type, length and other attributes of system calls made
by an application process can provide a unique signature or trace. Such information is highly informative,
and it is exploited in current IDS to help distinguish between normal and abnormal behaviors in a network
[10]
session .
[11]
Relevant benchmark datasets such as the Defense Advanced Research Projects Agency dataset and
[12]
the Knowledge Discovery and Data Mining Tools Competition (KDD’99) dataset have been analyzed
in a large number of studies for the past two decades [2-4,13-16] . However, such datasets do not cover up-to-
date attack scenarios, and therefore, they are not considered to be challenging at present. More recently,
the Australian Defence Force Academy Linux Dataset (ADFA-LD) [5,17,18] , as well as the Next-Generation
[20]
Intrusion Detection System Dataset (NGIDS-DS) [18,19] and the Web Conference 2019 (WWW2019)
datasets, succeeded in filling this gap, presenting new and relevant types of attacks conceived to assess the
accuracy of modern intrusion detection tools. The datasets present thousands of system call traces collected
from a Linux local server, with normal and attack behaviors.
Traditional machine learning algorithms can be fruitfully exploited to identify malicious patterns in
network sessions, which can be subsequently filtered. Examples of approaches in the literature include
[2]
[13]
Support Vector Machines , Artificial Neural Networks , classification of association rules [14,15] , decision
[3]
trees , random forests , and ensembles of classifiers .
[16]
[4]
However, machine learning algorithms cannot be directly applied to a raw data representation of network
[5-7]
traffic, such as sequences of system calls. For this reason, an active thread of recent research focuses on
the design and implementation of feature extraction techniques that aim at mapping sequences of system
calls to a new representation that can be processed by machine learning algorithms. Figure 1 shows the
typical analytical workflow that is carried out to perform machine learning-based intrusion detection.
Focusing on feature extraction approaches in the literature, pattern-based and frequency-based methods
represent the most popular classes. Pattern-based approaches identify patterns in sessions, consisting of
multiple co-occurring system calls in a trace, whereas frequency-based approaches [5,21,22] extract feature
vectors in which entries represent the frequency of a system call in a trace. Although the former generally
lead to a more accurate profile of the normal class, they are computationally more expensive. On the other
hand, the latter are more computationally efficient, but the resulting representation does not take into
account the position of system calls in the trace .
[6]