Page 40 - Read Online
P. 40

Corizzo et al. J Surveill Secur Saf 2020;1:140-50  I  http://dx.doi.org/10.20517/jsss.2020.15                                                 Page 141

               Keywords: Feature extraction, intrusion detection, network traffic, anomaly detection, word embeddings, language
               models





               INTRODUCTION
               Intrusion detection systems (IDS) play a fundamental role in modern organizations, providing defense
               mechanisms against cyberattacks. IDS monitor and analyze the traffic using different sources of
               information, with the purpose of identifying intrusions and other security breaches. Differently than
               firewalls, which limit access between networks to prevent intrusions, IDS evaluate a potential intrusion
               when it takes place, signal an alarm, and may terminate the connection. The most popular categories of IDS
                                                              [1]
               include network-based IDS and host-based IDS (HIDS) . The former analyze network packets on an entire
                     [2,3]
               subnet , whereas the latter consist of an agent on a host that analyzes system calls, file system changes, and
                   [4-7]
               logs . In this study, we focused on HIDS and, more specifically, machine learning-based tools to support
               it. One opportunity in this domain consists in monitoring and analyzing network traffic represented in the
               form of network sessions, also known as traces . One of the most popular data representations for traces is
                                                       [8]
                                                [9]
               that known as sequence of system calls , i.e., a sequence of requests that programs submit to the operating
               system kernel to perform any action. The ordering, type, length and other attributes of system calls made
               by an application process can provide a unique signature or trace. Such information is highly informative,
               and it is exploited in current IDS to help distinguish between normal and abnormal behaviors in a network
                     [10]
               session .

                                                                                                    [11]
               Relevant benchmark datasets such as the Defense Advanced Research Projects Agency dataset  and
                                                                                      [12]
               the Knowledge Discovery and Data Mining Tools Competition (KDD’99) dataset  have been analyzed
               in a large number of studies for the past two decades [2-4,13-16] . However, such datasets do not cover up-to-
               date attack scenarios, and therefore, they are not considered to be challenging at present. More recently,
               the Australian Defence Force Academy Linux Dataset (ADFA-LD) [5,17,18] , as well as the Next-Generation
                                                                                                        [20]
               Intrusion Detection System Dataset (NGIDS-DS) [18,19]  and the Web Conference 2019 (WWW2019)
               datasets, succeeded in filling this gap, presenting new and relevant types of attacks conceived to assess the
               accuracy of modern intrusion detection tools. The datasets present thousands of system call traces collected
               from a Linux local server, with normal and attack behaviors.


               Traditional machine learning algorithms can be fruitfully exploited to identify malicious patterns in
               network sessions, which can be subsequently filtered. Examples of approaches in the literature include
                                                               [2]
                                     [13]
               Support Vector Machines , Artificial Neural Networks , classification of association rules [14,15] , decision
                                  [3]
               trees , random forests , and ensembles of classifiers .
                                                            [16]
                   [4]
               However, machine learning algorithms cannot be directly applied to a raw data representation of network
                                                                                              [5-7]
               traffic, such as sequences of system calls. For this reason, an active thread of recent research  focuses on
               the design and implementation of feature extraction techniques that aim at mapping sequences of system
               calls to a new representation that can be processed by machine learning algorithms. Figure 1 shows the
               typical analytical workflow that is carried out to perform machine learning-based intrusion detection.

               Focusing on feature extraction approaches in the literature, pattern-based and frequency-based methods
               represent the most popular classes. Pattern-based approaches identify patterns in sessions, consisting of
               multiple co-occurring system calls in a trace, whereas frequency-based approaches [5,21,22]  extract feature
               vectors in which entries represent the frequency of a system call in a trace. Although the former generally
               lead to a more accurate profile of the normal class, they are computationally more expensive. On the other
               hand, the latter are more computationally efficient, but the resulting representation does not take into
               account the position of system calls in the trace .
                                                       [6]
   35   36   37   38   39   40   41   42   43   44   45