﻿<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-id journal-id-type="nlm-ta">Intell. Robot.</journal-id>
      <journal-id journal-id-type="publisher-id">IR</journal-id>
      <journal-title-group>
        <journal-title>Intelligence &amp; Robotics</journal-title>
      </journal-title-group>
      <issn pub-type="epub">2770-3541</issn>
      <publisher>
        <publisher-name>OAE Publishing Inc.</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.20517/ir.2026.13</article-id>
      <article-categories>
        <subj-group>
          <subject>Research Article</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Leveraging intelligent multimodal fusion for few-shot malware classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="yes">
          <name>
            <surname>Ren</surname>
            <given-names>Ying</given-names>
          </name>
          <xref ref-type="aff" rid="I1">
            <sup>1</sup>
          </xref>
          <xref ref-type="corresp" rid="cor1" />
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Liu</surname>
            <given-names>Ziyu</given-names>
          </name>
          <xref ref-type="aff" rid="I2">
            <sup>2</sup>
          </xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Wang</surname>
            <given-names>Junbo</given-names>
          </name>
          <xref ref-type="aff" rid="I3">
            <sup>3</sup>
          </xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Wang</surname>
            <given-names>Peng</given-names>
          </name>
          <xref ref-type="aff" rid="I4">
            <sup>4</sup>
          </xref>
        </contrib>
      </contrib-group>
      <aff id="I1">
        <sup>1</sup>Department of Outpatient, West China Hospital, Sichuan University, Chengdu 610065, Sichuan, China.</aff>
      <aff id="I2">
        <sup>2</sup>The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou 310058, Zhejiang, China.</aff>
      <aff id="I3">
        <sup>3</sup>College of Software Engineering, Sichuan University, Chengdu 610065, Sichuan, China.</aff>
      <aff id="I4">
        <sup>4</sup>College of Computer Science, Sichuan University, Chengdu 610065, Sichuan, China.</aff>
      <author-notes>
        <corresp id="cor1">Correspondence to: Prof. Ying Ren, Department of Outpatient, West China Hospital, Sichuan University, Chengdu 610065, Sichuan, China. E-mail: <email>zq156157@163.com</email></corresp>
        <fn fn-type="other">
          <p>
            <bold>Received:</bold> 9 Dec 2025 | <bold>First Decision:</bold> 13 Feb 2026 | <bold>Revised:</bold> 27 Mar 2026 | <bold>Accepted:</bold> 22 May 2026 | <bold>Published:</bold> 12 Jun 2026</p>
        </fn>
        <fn fn-type="other">
          <p>
            <bold>Academic Editor:</bold> Simon Yang | <bold>Copy Editor:</bold> Pei-Yun Wang | <bold>Production Editor:</bold> Pei-Yun Wang</p>
        </fn>
      </author-notes>
      <pub-date pub-type="ppub">
        <year>2026</year>
      </pub-date>
      <pub-date pub-type="epub">
        <day>12</day>
        <month>6</month>
        <year>2026</year>
      </pub-date>
      <volume>6</volume>
	  <issue>2</issue>
      <fpage>253</fpage>
	  <lpage>74</lpage>
      <permissions>
        <copyright-statement>© The Author(s) 2026.</copyright-statement>
        <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
          <license-p>© The Author(s) 2026. <bold>Open Access</bold> This article is licensed under a Creative Commons Attribution 4.0 International License (<uri xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</uri>), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.</license-p>
        </license>
      </permissions>
      <abstract>
        <p>Traditional malware classification methods heavily rely on extensive labeled data and single modal features, which limits their adaptability to evolving threats. In this paper, we propose an intelligent multimodal fusion framework that leverages complementary information from static and dynamic analysis for few-shot malware classification. Specifically, we convert malware binaries into grayscale images to capture static characteristics and extract application programming interface (API) call sequences to represent dynamic behaviors. To effectively integrate these heterogeneous modalities under limited data conditions, we introduce a lightweight graph neural network-based intelligent feature fusion module. This module segments modality-specific features, constructs a bipartite graph between segments, and performs cross-modal message passing to learn fine-grained correlations. The fused representations are then used in a prototypical network for few-shot classification. We construct two malware datasets augmented with multimodal features and conduct extensive experiments under few-shot settings. Results demonstrate that our approach significantly outperforms both unimodal baselines and naive fusion methods, achieving up to 95.73% accuracy in 5-way 5-shot classification. Ablation studies and efficiency analysis confirm that our fusion module adds minimal computational overhead while enhancing both accuracy and interpretability. This work highlights the potential of intelligent multimodal integration for robust malware classification with limited labeled data.</p>
      </abstract>
      <kwd-group>
        <kwd>Intelligent multimodal learning</kwd>
        <kwd>multimodal feature fusion</kwd>
        <kwd>malware classification</kwd>
        <kwd>few-shot learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec1">
      <title>1. INTRODUCTION</title>
      <p>Cybersecurity has emerged as a critical concern in the face of increasing Internet usage. Among the numerous cyber threats, malware stands out as a persistent and constantly evolving menace that poses continuous challenges to the security of computer systems and networks. The term “malware” encompasses a wide range of malicious software, including viruses, worms, Trojans, ransomware, and spyware, among others<sup>[<xref ref-type="bibr" rid="B1">1</xref>]</sup>. These malicious programs are specifically designed to infiltrate, disrupt, or compromise the integrity of computer systems, often resulting in data breaches, financial losses, and even the compromise of critical infrastructure. Thus, the development of robust and adaptive malware classification systems is imperative in safeguarding digital ecosystems.</p>
      <p>Traditionally, malware classification has relied on signature-based detection methods, where malware samples are matched against predefined signatures or patterns. However, this approach is limited in effectiveness, as it fails to detect new or emerging malware variants, particularly those with limited labeled samples, a situation that poses a significant threat to cybersecurity. It is important to clarify that few-shot learning differs fundamentally from zero-day detection. While zero-day detection aims to identify completely unseen threats without any prior labeled samples, few-shot learning assumes the availability of a small number of labeled examples for new classes. In this work, we focus on the few-shot setting, where the goal is to effectively leverage limited labeled data to classify new or emerging malware families. This scenario is practically relevant in cybersecurity operations, where security analysts can often obtain a few samples of a new threat through initial incident response or threat intelligence sharing. In response, the cybersecurity community has been actively exploring advanced techniques such as machine learning and deep learning to enhance the accuracy and agility of malware classification. These approaches primarily utilize various neural networks to extract features for classification<sup>[<xref ref-type="bibr" rid="B2">2</xref>-<xref ref-type="bibr" rid="B5">5</xref>]</sup>. Nonetheless, traditional machine learning models often struggle to keep pace with the ever-changing landscape of malicious software due to the need for frequent retraining on large-scale datasets.</p>
      <p>One promising avenue of research that has gained considerable attention is the application of few-shot learning techniques to malware classification<sup>[<xref ref-type="bibr" rid="B6">6</xref>,<xref ref-type="bibr" rid="B7">7</xref>]</sup>. Few-shot learning refers to the ability of a model to learn and generalize from a limited number of examples, making it particularly suitable for detecting new malware samples when only a few labeled instances are available. This approach is crucial in the continually evolving cybersecurity landscape, where malware authors consistently create new variants to evade detection. Despite the impressive performance achieved by previous research, it has consistently relied on a single modality, such as visualized malware images or application programming interface (API) invocation sequences, which may provide only partial insights into malware behavior and structure.</p>
      <p>Recent advances in few-shot learning for malware classification have explored various meta-learning and metric-learning paradigms. Wang <italic>et al.</italic> introduced a multi-prototype modeling approach to capture intra-class diversity in malware families<sup>[<xref ref-type="bibr" rid="B6">6</xref>]</sup>. More recently, Wang <italic>et al.</italic> developed AGProto, an adaptive graph prototypical network that adjusts prototypes based on sample relationships<sup>[<xref ref-type="bibr" rid="B7">7</xref>]</sup>. Beyond prototypical networks, transductive few-shot learning methods and data augmentation techniques specifically designed for malware have also been explored to address data scarcity. However, these approaches predominantly rely on unimodal features, limiting their ability to capture the full spectrum of malware characteristics.</p>
      <p>Inspired by the observation that modern malware exhibits characteristics that extend beyond the confines of traditional data sources, this paper presents an intelligent multimodal fusion approach to malware classification that leverages multimodal information using few-shot learning. Our key insight is that static structural patterns (e.g., binary code converted to images) and dynamic behavioral traces (e.g., API call sequences) provide complementary views of malware, and intelligently fusing these modalities can yield more robust representations, especially under limited data conditions. Specifically, our proposed model combines both static features, such as image-based representations of malware binaries as visual clues, and dynamic features, such as API call sequences as textual information. This combination enables our model to capture diverse aspects of malware behavior and structure. To achieve this, we construct two datasets specifically tailored for few-shot malware classification. Furthermore, we devise a novel lightweight graph neural network (GNN)-based feature fusion module that operates on segmented features from both modalities, constructing a bipartite graph to enable cross-modal message passing. This design differs from prior multimodal fusion approaches that rely on simple concatenation or bilinear pooling, as it explicitly models inter-modal relationships at the segment level. The resulting fused multimodal features can be readily employed in various few-shot learning paradigms, particularly metric-based learning methods such as prototypical networks<sup>[<xref ref-type="bibr" rid="B8">8</xref>]</sup>.</p>
      <p>The major contributions of this work are as follows:<break/>• We propose a novel lightweight GNN-based feature fusion module that segments modality-specific features, constructs a bipartite graph between segments, and performs cross-modal message passing to learn fine-grained correlations between static and dynamic malware characteristics.<break/>• We introduce a modality-specific normalization strategy tailored for few-shot learning, which standardizes features from different modalities to a common scale using support set statistics, preventing modality dominance during fusion.<break/>• We construct two few-shot malware classification datasets augmented with multimodal features, encompassing grayscale image features of static characteristics and dynamic API call sequence features, providing a benchmark for future research in this direction.<break/>• We conduct extensive experimentation on two datasets, augmented with ablation studies and analyses, demonstrating that our approach outperforms both unimodal baselines and existing fusion methods while adding minimal computational overhead.</p>
    </sec>
    <sec id="sec2">
      <title>2. RELATED WORK</title>
      <sec id="sec2-1">
        <title>2.1. Malware classification</title>
        <sec id="sec2-1-1">
          <title>2.1.1. Static analysis</title>
          <p>Static analysis is a primary technique in malware analysis, involving the extraction and selection of features from binary sequences, opcodes, function calls, printable strings, and other data found within executable malware files. These features are subsequently used for detection through machine learning or deep learning algorithms. Static analysis offers the benefits of effectiveness and efficiency but is also vulnerable to obfuscation and distortion techniques<sup>[<xref ref-type="bibr" rid="B9">9</xref>]</sup>.</p>
          <p>MalConv<sup>[<xref ref-type="bibr" rid="B10">10</xref>]</sup> initially processes the whole binary raw bytes into a deep learning network, utilizing convolutional neural networks (CNN) and recurrent neural networks (RNN) to extract features and perform classification. Gibert <italic>et al.</italic> examine the hierarchical nature of programs<sup>[<xref ref-type="bibr" rid="B11">11</xref>]</sup>, and introduce a hierarchical convolutional network to analyze byte sequences as n-gram-like features<sup>[<xref ref-type="bibr" rid="B12">12</xref>]</sup>.</p>
          <p>Inspired by the recent huge progress in computer vision techniques, researchers focus on designing methods to better convert malware into images, facilitating recognition via deep learning methods<sup>[<xref ref-type="bibr" rid="B13">13</xref>,<xref ref-type="bibr" rid="B14">14</xref>]</sup>. Cui <italic>et al.</italic> propose to translate binary bytes from executable files into a grayscale image, where values are within the range of 0 to 255<sup>[<xref ref-type="bibr" rid="B13">13</xref>]</sup>. Yuan <italic>et al.</italic> introduce a novel approach that transfers relationships within byte sequences to construct Markov images, capturing the interplay between malicious software bytes<sup>[<xref ref-type="bibr" rid="B15">15</xref>]</sup>. Sharma <italic>et al.</italic> compare grayscale, color, and Markov images for analyzing malicious software<sup>[<xref ref-type="bibr" rid="B16">16</xref>]</sup>. They propose the use of Gabor filtering to extract textures, identify focal areas, and recognize distinctive features. Moreover, the opcodes<sup>[<xref ref-type="bibr" rid="B17">17</xref>]</sup>, function calls<sup>[<xref ref-type="bibr" rid="B18">18</xref>,<xref ref-type="bibr" rid="B19">19</xref>]</sup>, and printable strings<sup>[<xref ref-type="bibr" rid="B20">20</xref>]</sup> obtained after decompilation of malicious software are also used for analysis.</p>
        </sec>
        <sec id="sec2-1-2">
          <title>2.1.2. Dynamic and hybrid analysis</title>
          <p>Dynamic analysis gathers data on the real-time execution behavior of samples by actively running and monitoring programs, including system call APIs<sup>[<xref ref-type="bibr" rid="B5">5</xref>,<xref ref-type="bibr" rid="B21">21</xref>-<xref ref-type="bibr" rid="B25">25</xref>]</sup>, network traffic<sup>[<xref ref-type="bibr" rid="B26">26</xref>,<xref ref-type="bibr" rid="B27">27</xref>]</sup>, and file interactions<sup>[<xref ref-type="bibr" rid="B28">28</xref>]</sup>, to comprehensively capture and scrutinize the program’s operational intent, facilitating the identification and classification of its malicious characteristics<sup>[<xref ref-type="bibr" rid="B29">29</xref>]</sup>. In contrast to static analysis, dynamic analysis incurs a higher analytical cost and can be evaded by some anti-virtualization, anti-sandbox, or time-triggered malware, while providing more comprehensive information<sup>[<xref ref-type="bibr" rid="B30">30</xref>]</sup>.</p>
          <p>Besides, to counter malware evasion techniques, many works combine static and dynamic features for analysis to comprehensively represent the information of malware<sup>[<xref ref-type="bibr" rid="B31">31</xref>]</sup>. Specifically, Yoo <italic>et al.</italic> extract various static and dynamic features, such as file size, header size, number of APIs, information entropy of different segments, API sequence, file information, locks, registry, and more, to make decisions based on machine learning methods<sup>[<xref ref-type="bibr" rid="B32">32</xref>]</sup>. Nguyen <italic>et al.</italic> conduct zombie software analysis with printable string information<sup>[<xref ref-type="bibr" rid="B33">33</xref>]</sup>. O’Shaughnessy <italic>et al.</italic> combine static malware executable files and dynamic process memory dump information for malware analysis with the KNN-HOG model<sup>[<xref ref-type="bibr" rid="B34">34</xref>]</sup>.</p>
        </sec>
        <sec id="sec2-1-3">
          <title>2.1.3. Multimodal analysis</title>
          <p>Static and dynamic analyses have demonstrated strong potential in malware research. However, these methods are susceptible to failure due to the fast evolution of malware characterized by techniques like obfuscation, evasion, and fuzzing. Consequently, researchers have turned their attention towards multimodal approaches, seeking to integrate information across diverse modalities for enhanced efficacy. Kim <italic>et al.</italic> introduce multimodal deep learning for Android malware analysis, utilizing Android manifest, dex, and .so files to extract distinct features<sup>[<xref ref-type="bibr" rid="B35">35</xref>]</sup>.</p>
          <p>Similarly, Gibert <italic>et al.</italic> treat the API sequence, bytecode, and opcode of malicious software as three modal data types, designing dedicated deep neural network modules to achieve more effective enhancements<sup>[<xref ref-type="bibr" rid="B36">36</xref>]</sup>. Dib <italic>et al.</italic> utilize two modalities, grayscale image representations from binary files and readable characters as a text representation from disassembled code, for malware classification<sup>[<xref ref-type="bibr" rid="B37">37</xref>]</sup>.</p>
          <p>Recent advances have also explored more sophisticated multimodal frameworks to address the evolving nature of malware. He <italic>et al.</italic> proposed DREAM, a system that combines classifier and expert knowledge within a unified model to combat concept drift in Android malware classification<sup>[<xref ref-type="bibr" rid="B38">38</xref>]</sup>. Their approach embeds malware behavioral concepts within the latent space of a contrastive autoencoder while constraining sample reconstruction based on classifier predictions, enabling more effective drift detection and adaptation. DREAM integrates both static and dynamic behavioral concepts, demonstrating the value of incorporating expert knowledge into multimodal learning.</p>
          <p>Chai <italic>et al.</italic> introduced MalFSCIL, a few-shot class-incremental learning framework for malware detection that addresses the challenges of catastrophic forgetting and decision boundary confusion<sup>[<xref ref-type="bibr" rid="B39">39</xref>]</sup>. Their method employs a decoupled training strategy combining a Variational Autoencoder (VAE) for feature enhancement with graph attention networks for dynamic boundary delineation based on class prototypes. MalFSCIL demonstrates the effectiveness of integrating generative models with incremental learning in adapting to new malware families with limited samples.</p>
          <p>Different from previous works, in this paper we utilize both static information (grayscale image converted from binary files) and dynamic analysis (API invocation sequences) as visual and textual modalities to obtain comprehensive representations. While DREAM focuses on concept drift detection and MalFSCIL addresses class-incremental learning, we propose a novel GNN-based fusion network specifically designed for few-shot malware classification. Unlike these approaches that primarily focus on single-modality enhancement or detection paradigms, our method explicitly models cross-modal correlations at the segment level, enabling fine-grained integration of complementary information from static and dynamic analysis.</p>
        </sec>
      </sec>
      <sec id="sec2-2">
        <title>2.2. Multimodal feature fusion</title>
        <p>Generally, multimodal feature fusion can benefit both unimodal tasks, e.g., image classification assisted by text<sup>[<xref ref-type="bibr" rid="B40">40</xref>,<xref ref-type="bibr" rid="B41">41</xref>]</sup>, and multimodal tasks such as visual question answering (VQA)<sup>[<xref ref-type="bibr" rid="B42">42</xref>]</sup> and image caption generation<sup>[<xref ref-type="bibr" rid="B43">43</xref>,<xref ref-type="bibr" rid="B44">44</xref>]</sup>. In early studies, Antol <italic>et al.</italic><sup>[<xref ref-type="bibr" rid="B42">42</xref>]</sup> utilized VGGNet<sup>[<xref ref-type="bibr" rid="B45">45</xref>]</sup> and LSTM<sup>[<xref ref-type="bibr" rid="B46">46</xref>]</sup> to extract visual and textual features respectively and fused them with simple mechanisms such as addition, concatenation, and multiplication. Stacked attention network (SAN)<sup>[<xref ref-type="bibr" rid="B47">47</xref>]</sup> designed for VQA is to progressively search for related image regions using question semantic representations. Based on low-rank bilinear pooling, bilinear attention network (BAN)<sup>[<xref ref-type="bibr" rid="B48">48</xref>]</sup> generates bilinear attention maps to fuse multimodal features. These two fusion methods are also employed in this work. Recently, Transformer<sup>[<xref ref-type="bibr" rid="B49">49</xref>]</sup> based vision-and-language fusion becomes a popular paradigm. A typical process is to extract text features with BERT<sup>[<xref ref-type="bibr" rid="B50">50</xref>]</sup> and fuse them with visual features via the self-attention mechanism. In this paper, we tailor Transformer-based architecture for fusing multimodal representations of malware.</p>
      </sec>
    </sec>
    <sec id="sec3">
      <title>3. DATASET CONSTRUCTION</title>
      <sec id="sec3-1">
        <title>3.1. Dataset description</title>
        <p>We construct two datasets for few-shot malware classification: VirusShare-M and LargePE-M. Both contain Windows PE malware samples with associated grayscale images and API call sequences.</p>
        <p>
          <bold>VirusShare-M:</bold> Malware samples are downloaded from VirusShare and labeled using AVClass based on VirusTotal reports. We select multiple families and split them into meta-training, meta-validation, and meta-test sets with disjoint family labels. Detailed family names are provided in the <inline-supplementary-material content-type="local-data" mimetype="application/pdf" xlink:href="ir6013-SupplementaryMaterials.pdf">Supplementary Materials</inline-supplementary-material>.</p>
        <p>
          <bold>LargePE-M:</bold> Samples are collected from a large in-house repository of Windows PE malware. Families with sufficient samples that executed successfully (i.e., those that generated API sequences in the Cuckoo sandbox) are selected and split into disjoint training, validation, and test sets by family. Detailed family names are provided in the <inline-supplementary-material content-type="local-data" mimetype="application/pdf" xlink:href="ir6013-SupplementaryMaterials.pdf">Supplementary Materials</inline-supplementary-material>.</p>
        <p>
          <bold>Data Leakage Prevention:</bold> To ensure valid few-shot evaluation, we take three precautions: (1) training, validation, and test sets contain completely disjoint malware families; (2) all preprocessing steps [term frequency-inverse document frequency (TF-IDF) calculation, word2vec training] are performed solely on the training set; (3) N-gram vocabulary and TF-IDF weights are derived exclusively from training samples. These measures prevent information leakage and ensure that evaluation reflects generalization to unseen families.</p>
        <p>Detailed family names and class distributions for both datasets are provided in the <inline-supplementary-material content-type="local-data" mimetype="application/pdf" xlink:href="ir6013-SupplementaryMaterials.pdf">Supplementary Tables 1 and 2</inline-supplementary-material>.</p>
      </sec>
      <sec id="sec3-2">
        <title>3.2. Visual malware image conversion</title>
        <p>Static grayscale image features of malware have been extensively utilized in related research on malware detection and classification<sup>[<xref ref-type="bibr" rid="B2">2</xref>,<xref ref-type="bibr" rid="B51">51</xref>,<xref ref-type="bibr" rid="B52">52</xref>]</sup>. We follow the approach proposed by Nataraj <italic>et al.</italic>, wherein each byte (8 bits) of malware binary code is transformed into an integer within the range of 0-255<sup>[<xref ref-type="bibr" rid="B51">51</xref>]</sup>. Subsequently, the width of the image is determined based on the size of the malware sample (as outlined in <xref ref-type="table" rid="t1">Table 1</xref>). To minimize information loss, we add zeros to the last row of pixels when the required length is not achieved. Finally, the images are resized to a standardized dimension of 256 × 256 pixels, and the pixel values are normalized to obtain the final grayscale images (as depicted in <xref ref-type="fig" rid="fig1">Figure 1</xref> left).</p>
        <fig id="fig1" position="float" width="500">
          <label>Figure 1</label>
          <caption>
            <p>Convert executable malware into two different information sources in two modalities.</p>
          </caption>
          <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="ir6013.fig.1.jpg" />
        </fig>
        <table-wrap id="t1">
          <label>Table 1</label>
          <caption>
            <p>Image width corresponding to converted malware of different sizes</p>
          </caption>
          <table frame="hsides" rules="groups">
            <thead>
              <tr>
                <td style="border-bottom:1;">
                  <bold>File size range</bold>
                </td>
                <td style="border-bottom:1;">
                  <bold>Image width</bold>
                </td>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td>&lt; 10 kB</td>
                <td>32</td>
              </tr>
              <tr>
                <td>10-30 kB</td>
                <td>64</td>
              </tr>
              <tr>
                <td>30-60 kB</td>
                <td>128</td>
              </tr>
              <tr>
                <td>60-100 kB</td>
                <td>256</td>
              </tr>
              <tr>
                <td>100-200 kB</td>
                <td>384</td>
              </tr>
              <tr>
                <td>200-500 kB</td>
                <td>512</td>
              </tr>
              <tr>
                <td>500-1,000 kB</td>
                <td>768</td>
              </tr>
              <tr>
                <td>&gt; 1,000 kB</td>
                <td>1,024</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
      <sec id="sec3-3">
        <title>3.3. Textual API sequence generation</title>
        <p>To capture the behavioral characteristics of malicious software, we extract API invocation sequences from malware samples using a feature extraction process similar to the one described by Wang <italic>et al.</italic><sup>[<xref ref-type="bibr" rid="B6">6</xref>]</sup>. Specifically, we executed malware samples within a virtual environment using the Cuckoo sandbox (<uri xlink:href="https://cuckoosandbox.org/">https://cuckoosandbox.org/</uri>). This execution generated a detailed report, which includes the API invocation sequences, as illustrated in <xref ref-type="fig" rid="fig1">Figure 1</xref> (right). Within these Cuckoo-generated reports, we selectively retain only the API names (e.g. “LdrLoadDll”, “LdrGetProcedureAddress”, “LdrGetDllHandle”), while disregarding both the parameters and returned values associated with each API invocation. Given that malware may initiate multiple processes concurrently, we concatenate APIs invoked by different processes to linearize the API sequence. Malware samples with sequence lengths less than 10 are excluded, which often indicates that the malware failed to run properly within the virtual environment.</p>
        <p>Following the acquisition of malware API invocation sequences, refinement is necessary to eliminate redundant features resulting from repeated executions, such as file loading or execution loops. Consistent with the preprocessing methodology outlined in Wang <italic>et al.</italic>, we omit redundant API subsequences wherein the same API appears more than twice consecutively<sup>[<xref ref-type="bibr" rid="B6">6</xref>]</sup>.</p>
        <p>For tokenizing API sequences tailored for extracting textual features, we adopt an N-gram approach, which employs a sliding window operation of size <italic>N</italic> on the API sequence. Specifically, we divide the API sequence of length <italic>a</italic> into (<italic>a</italic> - <italic>N</italic> + 1) N-gram items, treating the resultant N-gram items as new sequence features. This expansion increases the scale of the sequence’s word dictionary from <italic>α</italic> to <italic>α<sup>N</sup></italic> if there are <italic>α</italic> unique APIs in the original sequences. Subsequently, we compute the importance of each N-gram using the TF-IDF method [Equation (1)].</p>
        <p><disp-formula> <label>(1)</label> <tex-math id="E1"> $$  TF-IDF(\alpha_i)=TF(\alpha_i)\cdot IDF(\alpha_i)=\frac{n_i}{\sum_jn_j}\cdot\log\left(\frac{n}{f(\alpha_i)+1}\right) $$ </tex-math></disp-formula></p>
        <p>Here, <italic>α<sub>i</sub></italic> represents the <italic>i</italic>-th N-gram item, <italic>n<sub>i</sub></italic> denotes the frequency of occurrence of the <italic>i</italic>-th N-gram, and <italic>f</italic>(<italic>α<sub>i</sub></italic>) signifies the count of malware samples containing the <italic>i</italic>-th N-gram item. Following the computation of the importance of each N-gram item, we filter out the top-<italic>k</italic> N-grams based on their weights. Subsequently, for each sample’s N-gram sequence, we truncate the first <italic>l</italic> N-gram items to extract the sample’s features. Finally, we employ the widely used word2vec technique from the field of natural language processing (NLP) to embed these N-gram items. Specifically, we utilize a skip-gram model for pretraining, and the parameters of the word2vec model, along with the hyperparameters of the aforementioned feature extraction process, are detailed in <xref ref-type="table" rid="t2">Table 2</xref>.</p>
        <table-wrap id="t2">
          <label>Table 2</label>
          <caption>
            <p>Hyperparameters for API sequence processing</p>
          </caption>
          <table frame="hsides" rules="groups">
            <thead>
              <tr>
                <td style="border-bottom:1;">
                  <bold>Type</bold>
                </td>
                <td style="border-bottom:1;">
                  <bold>Hyperparameter</bold>
                </td>
                <td style="border-bottom:1;">
                  <bold>Value</bold>
                </td>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td rowspan="3">Preprocessing</td>
                <td>Number of TF-IDF top items <italic>k</italic></td>
                <td>2,000</td>
              </tr>
              <tr>
                <td>Maximum sequence length <italic>l</italic></td>
                <td>300</td>
              </tr>
              <tr>
                <td>N-gram window size</td>
                <td>3</td>
              </tr>
              <tr>
                <td rowspan="3">word2vec</td>
                <td>Embedding-dim</td>
                <td>300</td>
              </tr>
              <tr>
                <td>Learning rate</td>
                <td>0.025 → 0.001</td>
              </tr>
              <tr>
                <td>Training epoch</td>
                <td>5</td>
              </tr>
            </tbody>
          </table>
          <table-wrap-foot>
            <fn>
              <p>API: Application programming interface; TF-IDF: term frequency-inverse document frequency.</p>
            </fn>
          </table-wrap-foot>
        </table-wrap>
        <p>While our datasets are specifically designed for few-shot evaluation - where each class contains a limited number of samples - we acknowledge that real-world malware ecosystems are more complex. In practice, malware families exhibit long-tailed distributions, temporal evolution, and concept drift. Our current datasets, derived from curated families with balanced samples, do not fully capture these dynamics. However, they provide a controlled benchmark for evaluating few-shot learning algorithms under class-disjoint settings.</p>
      </sec>
    </sec>
    <sec id="sec4">
      <title>4. METHODS</title>
      <p>In this section, we first present some preliminaries about few-shot malware classification and an overview of the proposed multimodal framework for this task. Then, we introduce the unimodal feature extraction module for each modality. Finally, we introduce our proposed multimodal feature fusion module based on GNN. The flow of the overall framework is shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>.</p>
      <fig id="fig2" position="float">
        <label>Figure 2</label>
        <caption>
          <p>Framework of leveraging multimodal features for few-shot malware classification. Initially, malware images and API invocation sequences are derived from malware binary files and then fed into two distinct encoders, yielding respective unimodal features. Within the feature fusion module, features from both modalities undergo pooling and normalization separately before being segmented. The segmented features are then integrated through a GNN for feature fusion. Classification is then performed using a prototypical network. API: Application programming interface; GNN: graph neural network.</p>
        </caption>
        <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="ir6013.fig.2.jpg" />
      </fig>
      <sec id="sec4-1">
        <title>4.1. Preliminaries</title>
        <p>The goal of few-shot learning is to make accurate predictions when provided with only a limited amount of labeled data. This is particularly suitable for the classification of malware, as in most cases, it is difficult to obtain a large number of malware samples. To achieve this, we adopt the widely used meta-learning paradigm<sup>[<xref ref-type="bibr" rid="B53">53</xref>,<xref ref-type="bibr" rid="B54">54</xref>]</sup>. Given a malware dataset <inline-formula><tex-math id="M1">$$ \mathcal{D} $$</tex-math></inline-formula>, we split it into three parts meta-training <inline-formula><tex-math id="M1">$$ \mathcal{D} $$</tex-math></inline-formula><italic><sup>train</sup></italic>, meta-validation <inline-formula><tex-math id="M1">$$ \mathcal{D} $$</tex-math></inline-formula><italic><sup>val</sup></italic>, and meta-test <inline-formula><tex-math id="M1">$$ \mathcal{D} $$</tex-math></inline-formula><italic><sup>test</sup></italic> with disjoint classes. The task in each part comprises a support set <italic>S</italic> and a query set <italic>Q</italic> that share the same label space.</p>
        <p>The objective of meta-learning is to correctly classify samples in query set <italic>Q</italic> based on labeled samples in support set <italic>S</italic> into <italic>N</italic> classes. When support set <italic>S</italic> has <italic>N</italic> families in total and each family possesses <italic>K</italic> samples, this few-shot classification task is called <italic>N</italic>-way <italic>K</italic>-shot task. Generally, we employ meta-training <inline-formula><tex-math id="M1">$$ \mathcal{D} $$</tex-math></inline-formula><italic><sup>train</sup></italic> to train the model and meta-validation <inline-formula><tex-math id="M1">$$ \mathcal{D} $$</tex-math></inline-formula><italic><sup>val</sup></italic> to fine-tune hyperparameters and assess the model’s generalization performance, and test the model’s performance on unseen classes/families through meta-test <inline-formula><tex-math id="M1">$$ \mathcal{D} $$</tex-math></inline-formula><italic><sup>test</sup></italic>.</p>
      </sec>
      <sec id="sec4-2">
        <title>4.2. Framework overview</title>
        <p>The schematic representation of our proposed framework is depicted in <xref ref-type="fig" rid="fig2">Figure 2</xref>. When an image-API pair (<italic>x<sub>I</sub></italic>, <italic>x<sub>A</sub></italic>) is derived from a malware file using the methodologies outlined in Section 3, it is processed by dual encoders for feature extraction. Specifically, we propose employing a visual encoder <italic>f<sub>I</sub></italic>(·) to extract features from the image <italic>x<sub>I</sub></italic>, and a textual encoder <italic>f<sub>A</sub></italic>(·) for the API sequence <italic>x<sub>A</sub></italic>. Subsequently, a multimodal feature fusion module is employed to integrate the feature information extracted from the grayscale image and API invocation sequence. For feature fusion, we segment the normalized features, apply a GNN to integrate them, and finally employ a prototypical network<sup>[<xref ref-type="bibr" rid="B8">8</xref>]</sup> for classification.</p>
        <p>
          <bold>Support and Query Set Processing:</bold> For each task, support set samples <italic>S</italic> are encoded (<italic>E<sub>I</sub><sup>S</sup></italic>, <italic>E<sub>A</sub><sup>S</sup></italic>), normalized using support set statistics, segmented, and fused via GNN to produce <italic>V<sup>S</sup></italic>, which are aggregated into prototypes <italic>p<sub>k</sub></italic>. Query set samples <italic>Q</italic> undergo the same encoding, normalization (using same statistics), and fusion to produce <italic>V<sup>Q</sup></italic>, which are compared with prototypes via Euclidean distance: <inline-formula><tex-math id="M1">$$ \hat{y} $$</tex-math></inline-formula> = argmin<italic><sub>k</sub> </italic>||<italic>V<sup>Q</sup></italic> - <italic>p<sub>k</sub></italic>||<sub>2</sub>.</p>
      </sec>
      <sec id="sec4-3">
        <title>4.3. Feature extractor</title>
        <p>To address the limited data in few-shot scenarios, we design two lightweight feature extractors for two modalities, with the expectation of reducing overfitting.</p>
        <sec id="sec4-3-1">
          <title>4.3.1. Malware image encoder</title>
          <p>In Section 3.2, we discuss the methodology for converting malware into grayscale images. For the converted image <italic>x<sub>I</sub></italic>, a series of data augmentations - including resizing with a random crop and random rotation - is employed to enhance the diversity of the training images. For the embedding module, we choose the CNN4 backbone network<sup>[<xref ref-type="bibr" rid="B55">55</xref>]</sup>, which is commonly used in few-shot image classification. Our CNN4 implementation differs slightly from the conventional architecture: it uses 32 channels in the first convolutional layer instead of the standard 64. This adjustment accounts for the input being grayscale images, initially containing only a single channel rather than the three channels found in RGB images. Gradually increasing the number of channels enhances feature extraction while reducing computational demands. Additionally, a residual connection is introduced between the last two CNN layers, a modification that has been shown to stabilize training and expedite model convergence. In the final layer, an AdaptiveMaxPool with dimensions (2, 4) is applied - considering that grayscale images contain more semantic information horizontally than vertically - resulting in an output feature dimension of 512. The structure of the Image Encoder is illustrated in <xref ref-type="fig" rid="fig3">Figure 3A</xref>.</p>
          <fig id="fig3" position="float">
            <label>Figure 3</label>
            <caption>
              <p>Proposed feature extractors for both modalities. (A) CNN4 architecture for malware image encoding, featuring 32 → 64 → 128 channel progression with residual connections and adaptive pooling; (B) Simple Transformer architecture for API sequence encoding, with two hidden layers and two attention heads. CNN: Convolutional neural network; API: application programming interface.</p>
            </caption>
            <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="ir6013.fig.3.jpg" />
          </fig>
        </sec>
        <sec id="sec4-3-2">
          <title>4.3.2. Malware API Encoder</title>
          <p>In Section 3.3, we describe the preprocessing methods... and obtain embeddings of each malware sample’s API call sequence using Word2Vec. Following the lightweight design of the Image Encoder, we propose a streamlined Transformer architecture<sup>[<xref ref-type="bibr" rid="B49">49</xref>]</sup> as the API encoder <italic>f<sub>A</sub></italic>(·), featuring only two hidden layers, each with two attention heads. Specifically, the embedding dimension for each token is set to 300, and the dimension of the feedforward network is equally established at 300 to align with the output dimension of the Image Encoder. We use a dropout rate of 0.5 to mitigate overfitting. The architecture of the API Encoder is depicted in <xref ref-type="fig" rid="fig3">Figure 3B</xref>.</p>
        </sec>
      </sec>
      <sec id="sec4-4">
        <title>4.4. GNN-based feature fusion module</title>
        <p>The preceding section describes the extraction methods for grayscale image features and API call sequence characteristics of malicious software. This section delineates the design of our devised lightweight feature fusion module based on GNNs. First, we normalize the pooled features independently and then segment them into discrete chunks, each representing a node in a graph. We then establish edges between nodes and perform message passing across the graph to achieve feature fusion. Finally, we employ a prototype network to calculate the distances within the fused feature space between samples and prototypes, thereby enabling classification. The pseudo code for training our proposed GNN-based feature fusion module over a single epoch is presented in <xref ref-type="fig" rid="alg1">Algorithm 1</xref>.</p>
       <fig id="alg1" position="float">
            <label>Algorithm 1</label>
            <caption />
            <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="ir6013.alg.1.jpg" />
          </fig>
        <p>
          <bold>Complexity Analysis:</bold> The computational complexity of <xref ref-type="fig" rid="alg1">Algorithm 1</xref> is dominated by the feature extraction and GNN fusion steps. For a batch with <italic>N</italic> support samples and <italic>M</italic> query samples, the image encoder <italic>f<sub>I</sub></italic> has complexity <italic>O</italic>((<italic>N</italic> + <italic>M</italic>)·<italic>C<sub>I</sub></italic>) where <italic>C<sub>I</sub></italic> is the CNN4 complexity (<italic>O</italic>(<italic>H</italic>·<italic>W</italic>·<italic>D</italic>)). The API encoder <italic>f<sub>A</sub></italic> has complexity <italic>O</italic>((<italic>N</italic> + <italic>M</italic>)·<italic>L</italic>·<italic>D</italic>·<italic>H</italic>) where <italic>L</italic> is sequence length and <italic>H</italic> is number of attention heads. The GNN fusion module operates on 2<italic>N</italic> segments (where <italic>N</italic> is the number of segments per modality) with complexity <italic>O</italic>(<italic>N</italic><sup>2</sup>·<italic>D</italic>) for attention computation and <italic>O</italic>(<italic>N</italic>·<italic>D</italic><sup>2</sup>) for message passing. Overall, the per-epoch complexity is <italic>O</italic>((<italic>N</italic> + <italic>M</italic>)·(<italic>C<sub>I</sub></italic> + <italic>L</italic>·<italic>D</italic>·<italic>H</italic>) + <italic>N</italic><sup>2</sup>·<italic>D</italic> + <italic>N</italic>·<italic>D</italic><sup>2</sup>), which scales linearly with batch size and quadratically with the number of segments. In practice, with <italic>N</italic> = 4 segments and <italic>D</italic> = 512, the fusion overhead is negligible compared to feature extraction.</p>
        <sec id="sec4-4-1">
          <title>4.4.1. Normalization</title>
          <p>After extracting features from images and API sequences, both modalities are expected to contribute significantly to the final few-shot classification performance. However, as shown in <xref ref-type="fig" rid="fig4">Figure 4</xref>, the numerical ranges of features extracted by the Image Encoder and API Encoder differ significantly before fusion. These disparities would disproportionately bias the fusion process toward the modality with larger numerical values, even though the magnitude does not reflect the quality of the features, which could harm classification performance. To mitigate this issue and map the features of both modalities to a uniform scale, we standardize the features of both modalities to a mean of 0 and a standard deviation of 1, thereby aligning them with the standard normal distribution.</p>
          <fig id="fig4" position="float">
            <label>Figure 4</label>
            <caption>
              <p>Comparison of value distribution before and after normalization of Image features and API sequence features. The histograms are computed from 1,000 randomly sampled feature vectors from the VirusShare-M validation set, showing the distribution shift after applying the proposed modality-specific normalization. API: Application programming interface.</p>
            </caption>
            <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="ir6013.fig.4.jpg" />
          </fig>
          <p>Given the unique characteristic of few-shot learning, where the samples in the support set are transparent to the model for each task, we leverage the mean and variance of the support set’s feature data as prior information, applying it to the query set. This approach ensures that there is no information leakage between samples in the query set. Specifically, for the image features of the support set <italic>E<sub>I</sub><sup>S</sup></italic>, the mean and variance are calculated according to the following Equation:</p>
          <p><disp-formula> <label>(2)</label> <tex-math id="E1"> $$  \mu_s = \frac{1}{n}\sum_{i=1}^n E_{I,i}^S $$ </tex-math></disp-formula></p>
		  <p><disp-formula> <label>(3)</label> <tex-math id="E1"> $$  \sigma_s = \sqrt{\frac{1}{n}\sum_{i=1}^n (E_{I,i}^S - \mu_s)^2} $$ </tex-math></disp-formula></p>
          <p>Here, <italic>μ<sub>s</sub></italic> denotes the mean of the sample features within the support set <italic>S</italic>, with <italic>n</italic> representing the number of samples in the support set <italic>S</italic>, and <italic>E<sub>I</sub></italic><sub>,</sub><italic><sub>i</sub></italic><italic><sup>S</sup></italic> signifying the feature of the <italic>i</italic>-th sample within the support set <italic>S</italic>. Upon determining the statistical measures of the support set’s sample features, these mean and variance values are utilized to normalize the features of both the support set and query set samples.</p>
          <p><disp-formula> <label>(4)</label> <tex-math id="E1"> $$  E_I^S = (E_I^S - \mu_s) / \sigma_s $$ </tex-math></disp-formula></p>
		  <p><disp-formula> <label>(5)</label> <tex-math id="E1"> $$  E_I^Q = (E_I^Q - \mu_s) / \sigma_s $$ </tex-math></disp-formula></p>
          <p>As depicted in <xref ref-type="fig" rid="fig4">Figure 4</xref>, normalization of the two modalities’ features results in a closer numerical range alignment, while preserving the relative magnitude of the original features.</p>
        </sec>
        <sec id="sec4-4-2">
          <title>4.4.2. Feature fusion GNN</title>
          <p>We first describe how to construct the graph <italic>G</italic> on the basis of features from two modalities. Initially, we segment both the image embedding <italic>I</italic> and the API embedding <italic>A</italic>, both elements of ℝ<sup>1×</sup><italic><sup>D</sup></italic>, into multiple parts. Each segment corresponds to a node within the graph, resulting in the node set <italic>V</italic> = [<italic>I</italic><sub>1</sub>, <italic>I</italic><sub>2</sub>, …, <italic>I<sub>N</sub></italic>, <italic>A</italic><sub>1</sub>, <italic>A</italic><sub>2</sub>, …, <italic>A<sub>N</sub></italic>]. Each node <italic>V<sub>i</sub></italic> embodies the feature information of a specific segment and is an element of ℝ<sup>1×</sup><italic><sup>K</sup></italic>, where <italic>K</italic> denotes the length of each segment and is calculated as <italic>K</italic> = <italic>D</italic>//<italic>N</italic>. To facilitate the fusion of information across modalities, we introduce edges exclusively between segments of different modalities, thereby constructing a bipartite graph. Notably, segments within the same modality do not connect via edges. The graphical representation of this construction is depicted in <xref ref-type="fig" rid="fig2">Figure 2</xref>.</p>
          <p>We then apply a graph message passing operation to the node set <italic>V</italic>. This yields the transformed feature set <italic>V</italic>′, which contains the updated node features after message passing.</p>
          <p><disp-formula> <label>(6)</label> <tex-math id="E1"> $$  V' = \sigma(\mathcal{A}VW_{in})W_{out} + V $$ </tex-math></disp-formula></p>
          <p>where <italic>W<sub>in</sub></italic> and <italic>W<sub>out</sub></italic> are the weights of fully-connected layers, <italic>σ</italic> is the activation function, <inline-formula><tex-math id="M1">$$ \mathcal{A} $$</tex-math></inline-formula> is the normalized adjacency matrix that is obtained by:</p>
          <p><disp-formula> <label>(7)</label> <tex-math id="E1"> $$  \mathcal{A} = Softmax(LeakyReLU(A)) $$ </tex-math></disp-formula></p>
          <p>where the adjacency matrix <italic>A</italic> represents the connections of nodes within a graph. Next, we introduce how to generate <italic>A</italic>.</p>
          <p>We define <italic>e<sub>ij</sub></italic> as a measure of the relationship between <italic>V<sub>i</sub></italic> and <italic>V<sub>j</sub></italic>. The underlying premise is that the closer or more relevant the nodes are to each other, the larger the value of <italic>e<sub>ij</sub></italic> should be. The formula for this relationship measure can be represented as follows:</p>
          <p><disp-formula> <label>(8)</label> <tex-math id="E1"> $$  e_{ij} = f_\theta(V_i, V_j) $$ </tex-math></disp-formula></p>
          <p>Here, <italic>f<sub>θ</sub></italic> is defined as a distance function between different samples, where <italic>θ</italic> denotes the learnable parameters.</p>
          <p>The proximity between feature segments from two distinct modalities suggests similarity in their semantic content. For instance, an API call sequence from a malware sample corresponds to a binary code segment, which in turn corresponds to a patch in the converted grayscale image. We posit that if the model can identify that the features of the API call sequence and the patch in the grayscale image are a match, then it is possible to amalgamate the information from both modalities, thereby obtaining a deeper representation of features.</p>
          <p>To learn such semantic representations and to ensure symmetry in these relationships, such that <italic>e<sub>ij</sub></italic> = <italic>e<sub>ji</sub></italic>, we employ a multilayer perceptron (MLP) structured after the absolute difference between two vector nodes, as proposed by the referenced literature:</p>
          <p><disp-formula> <label>(9)</label> <tex-math id="E1"> $$ f_\theta(x, y) = MLP_\theta(abs(x-y)) $$ </tex-math></disp-formula></p>
          <p>where MLP represents a multilayer perceptron architecture, adopting a structure similar to<sup>[<xref ref-type="bibr" rid="B7">7</xref>]</sup>. Here, “abs” denotes the absolute value function.</p>
          <p>Upon establishing pairwise distances between segments, we can construct the adjacency matrix <italic>A</italic>, represented mathematically as:</p>
          <p><disp-formula> <label>(10)</label> <tex-math id="E1"> $$  A=\{e_{ij}|1\leq i,j\leq 2N\} $$ </tex-math></disp-formula></p>
          <p>Following the construction of the adjacency matrix, message passing operations as described in Equation (6) facilitate the infusion of API features into image feature segments and vice versa. Subsequently, each segment is concatenated, resulting in a fused multimodal feature set <italic>V</italic>′ = <italic>concat</italic>[<italic>V</italic><sub>1</sub>′, ...] ∈ ℝ<sup>1×2</sup><italic><sup>D</sup></italic>, which incorporates the combined features of both modalities.</p>
          <p>This design is inspired by the observation that malware behaviors manifest across both static code structures and dynamic API invocations. By segmenting features and constructing a bipartite graph between modality segments, our model can capture cross-modal correlations that are missed by global feature fusion approaches. For instance, a specific API call pattern related to registry manipulation may align with image regions corresponding to code sections that perform similar operations. The message passing mechanism allows these complementary signals to reinforce each other, leading to more discriminative features for few-shot classification.</p>
        </sec>
      </sec>
      <sec id="sec4-5">
        <title>4.5. Prototypical network</title>
        <p>We then use the fused multimodal features for few-shot malware classification. We have opted to utilize foundational and typical prototypical networks<sup>[<xref ref-type="bibr" rid="B8">8</xref>]</sup>. As shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>, after obtaining the fused feature <italic>V</italic>′, we generate the prototype for each class as:</p>
        <p><disp-formula> <label>(11)</label> <tex-math id="E1"> $$  p_k=\frac{1}{|S_k|}\sum_{V^S_i\in S_k}V^S_i $$ </tex-math></disp-formula></p>
        <p>Here, <italic>S<sub>k</sub></italic> denotes the set of samples belonging to class <italic>k</italic>, and <italic>V<sub>i</sub><sup>S</sup></italic> represents the features of the <italic>i</italic>-th sample within the support set.</p>
        <p>Subsequently, we compute the probability corresponding to each class for a given query vector <italic>V<sub>i</sub><sup>Q</sup></italic> as follows:</p>
        <p><disp-formula> <label>(12)</label> <tex-math id="E1"> $$  \mathrm{prob}(y=k|V_i^Q)=\frac{\exp(-d(V_i^Q,p_k))}{\sum_{k^{'}}\exp(-d(V_i^Q,p_{k^{'}}))} $$ </tex-math></disp-formula></p>
        <p>where <italic>d</italic> is the distance function, for which we have selected the Euclidean distance. The class receiving the highest probability will be designated as the final malware classification category.</p>
        <p>The training model’s loss function is composed of three parts: besides the cross-entropy loss of the predicted probabilities in the prototypical network using multimodal features <italic>L<sub>M</sub></italic>, it also includes the losses <italic>L<sub>I</sub></italic> and <italic>L<sub>A</sub></italic> for the predicted probabilities in the prototypical network using unimodal features, respectively (as shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>). The composite loss function is expressed as:</p>
        <p><disp-formula> <label>(13)</label> <tex-math id="E1"> $$  L = \lambda_0 L_M + \lambda_1 L_I + \lambda_2 L_A $$ </tex-math></disp-formula></p>
        <p>The calculations for <italic>L<sub>M</sub></italic>, <italic>L<sub>I</sub></italic> and <italic>L<sub>A</sub></italic> are analogous; here, <italic>L<sub>M</sub></italic> is provided as an example:</p>
        <p><disp-formula> <label>(14)</label> <tex-math id="E1"> $$  L_M = -\sum_{k=1}^Cy_k\log(\mathrm{prob}(y=k|V^Q)) $$ </tex-math></disp-formula></p>
      </sec>
    </sec>
    <sec id="sec5">
      <title>5. EXPERIMENTS</title>
      <p>We mainly evaluate our method on two datasets: VirusShare-M and LargePE-M. A detailed description of the datasets is given in Section 3.</p>
      <sec id="sec5-1">
        <title>5.1. Baseline models</title>
        <p>The baseline models for the comparison of our methods are shown below, including methods that only use API sequences and images as well as simple multimodal feature fusion methods.</p>
        <p>• <bold>ProtoNet</bold><sup>[<xref ref-type="bibr" rid="B8">8</xref>]</sup> is a classic few-shot learning method that learns a prototype representation for each class. Applicable to both API and image modalities.</p>
        <p>• <bold>HybridAttentionNet</bold><sup>[<xref ref-type="bibr" rid="B56">56</xref>]</sup> uses instance-level and feature attention modules to better weight query-related samples and feature dimensions. Applicable to both modalities.</p>
        <p>• <bold>API Frequency Histogram</bold><sup>[<xref ref-type="bibr" rid="B57">57</xref>]</sup> counts API invocation frequencies as feature vectors for malware samples.</p>
        <p>• <bold>Pixel k-NN</bold> classifies malware based on image pixels using the k-nearest neighbors algorithm.</p>
        <p>• <bold>Early fusion</bold> concatenates features from different modalities before feeding into prototypical network.</p>
        <p>• <bold>Late fusion</bold> averages the final logits from all modalities.</p>
        <p>• <bold>Butterfly Vision Transformer (BViT)</bold><sup>[<xref ref-type="bibr" rid="B58">58</xref>]</sup><bold>:</bold> A ViT architecture for visualization-based malware classification that captures both local and global spatial representations via butterfly construction enabling parallel patch processing. We evaluate BViT/B16 (174.2 M params) and BViT/L16 (425.6 M params).</p>
        <p>• <bold>DREAM</bold><sup>[<xref ref-type="bibr" rid="B38">38</xref>]</sup><bold>:</bold> A concept drift detection system that embeds malware behavioral concepts in a contrastive autoencoder latent space, enabling model-sensitive drift detection without training data access during testing. We adapted DREAM to work with our grayscale images.</p>
        <p>• <bold>MalFSCIL</bold><sup>[<xref ref-type="bibr" rid="B39">39</xref>]</sup><bold>:</bold> A few-shot class-incremental learning framework using VAE for feature enhancement and graph attention networks for dynamic prototype-based boundary delineation to address catastrophic forgetting. It converts binaries to grayscale images using a ResNet18 backbone.</p>
      </sec>
      <sec id="sec5-2">
        <title>5.2. Experimental setup</title>
        <p>Following the paradigm in most few-shot experiments, we set N = [5, 10] and K = [5, 10] (notations in Section 4.1) resulting in 4 combinations for the VirusShare-M dataset, and N = [5, 10] and K = 5 for LargePE-M due to the limited number of samples. We use Adam optimizer with a learning rate of 5<italic>e</italic><sup>-5</sup> for multimodal and API sequences only models and 1<italic>e</italic><sup>-4</sup> for images only models, weight decay is set to 1<italic>e</italic><sup>-4</sup>. All the networks are trained for 5 × 10<sup>4</sup> epochs and are saved when achieving the lowest loss on the meta-validation set for testing. All methods are implemented in PyTorch<sup>[<xref ref-type="bibr" rid="B59">59</xref>]</sup> and we use a machine with an Intel(R) Core(TM) i9-10900X CPU with four NVIDIA GeForce RTX 2080Ti GPUs for all the experiments.</p>
      </sec>
      <sec id="sec5-3">
        <title>5.3. Experimental results</title>
        <p>We present the main results of our method and baselines in <xref ref-type="table" rid="t3">Table 3</xref>. We train the models with three different random seeds (0, 1, and 2) and report the average as the overall performance.</p>
        <table-wrap id="t3">
          <label>Table 3</label>
          <caption>
            <p>The accuracy and 95% confidence interval of our proposed method and baselines on the two datasets</p>
          </caption>
          <table frame="hsides" rules="groups">
            <thead>
              <tr>
                <td colspan="2">
                  <bold>Dataset</bold>
                </td>
                <td colspan="4">
                  <bold>Virusshare-M</bold>
                </td>
                <td colspan="2">
                  <bold>LargePE-M</bold>
                </td>
              </tr>
              <tr>
                <td rowspan="2">
                  <bold>Model</bold>
                </td>
                <td rowspan="2">
                  <bold>Modality</bold>
                </td>
                <td colspan="2">
                  <bold>5-shot</bold>
                </td>
                <td colspan="2">
                  <bold>10-shot</bold>
                </td>
                <td colspan="2">
                  <bold>5-shot</bold>
                </td>
              </tr>
              <tr>
                <td style="border-bottom:1;">
                  <bold>5-way</bold>
                </td>
                <td style="border-bottom:1;">
                  <bold>10-way</bold>
                </td>
                <td style="border-bottom:1;">
                  <bold>5-way</bold>
                </td>
                <td style="border-bottom:1;">
                  <bold>10-way</bold>
                </td>
                <td style="border-bottom:1;">
                  <bold>5-way</bold>
                </td>
                <td style="border-bottom:1;">
                  <bold>10-way</bold>
                </td>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td>Pixel kNN</td>
                <td>I</td>
                <td>72.15 ± 1.12</td>
                <td>63.30 ± 1.22</td>
                <td>74.31 ± 1.35</td>
                <td>66.73 ± 0.97</td>
                <td>72.94 ± 1.06</td>
                <td>66.37 ± 1.21</td>
              </tr>
              <tr>
                <td>HybridAttentionNet</td>
                <td>I</td>
                <td>81.11 ± 0.25</td>
                <td>74.11 ± 0.25</td>
                <td>82.57 ± 0.31</td>
                <td>74.51 ± 0.29</td>
                <td>83.28 ± 0.24</td>
                <td>74.66 ± 0.32</td>
              </tr>
              <tr>
                <td>ProtoNet</td>
                <td>I</td>
                <td>83.27 ± 0.15</td>
                <td>75.74 ± 0.18</td>
                <td>84.16 ± 0.19</td>
                <td>79.80 ± 0.16</td>
                <td>84.56 ± 0.14</td>
                <td>75.87 ± 0.17</td>
              </tr>
              <tr>
                <td>BViT/B16</td>
                <td>I</td>
                <td>88.47 ± 0.12</td>
                <td>81.23 ± 0.15</td>
                <td>89.31 ± 0.11</td>
                <td>84.56 ± 0.13</td>
                <td>88.94 ± 0.10</td>
                <td>80.43 ± 0.14</td>
              </tr>
              <tr>
                <td>BViT/L16</td>
                <td>I</td>
                <td>89.82 ± 0.10</td>
                <td>82.67 ± 0.13</td>
                <td>90.45 ± 0.09</td>
                <td>85.89 ± 0.11</td>
                <td>90.17 ± 0.09</td>
                <td>81.76 ± 0.12</td>
              </tr>
              <tr>
                <td>DREAM</td>
                <td>I</td>
                <td>90.15 ± 0.09</td>
                <td>83.42 ± 0.12</td>
                <td>91.23 ± 0.08</td>
                <td>86.71 ± 0.10</td>
                <td>91.08 ± 0.08</td>
                <td>82.94 ± 0.11</td>
              </tr>
              <tr>
                <td>MalFSCIL</td>
                <td>I</td>
                <td>90.58 ± 0.08</td>
                <td>84.27 ± 0.11</td>
                <td>92.65 ± 0.07</td>
                <td>88.15 ± 0.09</td>
                <td>91.87 ± 0.07</td>
                <td>84.56 ± 0.10</td>
              </tr>
              <tr>
                <td>API Frequency Hist.</td>
                <td>A</td>
                <td>85.22 ± 0.33</td>
                <td>81.40 ± 0.21</td>
                <td>87.39 ± 0.27</td>
                <td>85.17 ± 0.24</td>
                <td>86.21 ± 0.25</td>
                <td>78.53 ± 0.29</td>
              </tr>
              <tr>
                <td>HybridAttentionNet</td>
                <td>A</td>
                <td>84.14 ± 0.22</td>
                <td>80.11 ± 0.37</td>
                <td>87.77 ± 0.25</td>
                <td>83.96 ± 0.27</td>
                <td>85.78 ± 0.23</td>
                <td>77.98 ± 0.31</td>
              </tr>
              <tr>
                <td>ProtoNet</td>
                <td>A</td>
                <td>85.32 ± 0.14</td>
                <td>79.70 ± 0.16</td>
                <td>86.60 ± 0.13</td>
                <td>82.44 ± 0.15</td>
                <td>86.12 ± 0.13</td>
                <td>77.11 ± 0.16</td>
              </tr>
              <tr>
                <td>Early fusion</td>
                <td>A + I</td>
                <td>91.47 ± 0.10</td>
                <td>86.70 ± 0.14</td>
                <td>93.47 ± 0.09</td>
                <td>88.25 ± 0.12</td>
                <td>94.02 ± 0.07</td>
                <td>91.76 ± 0.12</td>
              </tr>
              <tr>
                <td>Late fusion</td>
                <td>A + I</td>
                <td>92.01 ± 0.10</td>
                <td>84.22 ± 0.14</td>
                <td>94.18 ± 0.08</td>
                <td>87.78 ± 0.13</td>
                <td>94.46 ± 0.06</td>
                <td>92.28 ± 0.19</td>
              </tr>
              <tr>
                <td>BViT/B16 + API (Late)</td>
                <td>A + I</td>
                <td>93.84 ± 0.09</td>
                <td>88.91 ± 0.12</td>
                <td>94.73 ± 0.08</td>
                <td>89.62 ± 0.10</td>
                <td>95.21 ± 0.07</td>
                <td>92.85 ± 0.09</td>
              </tr>
              <tr>
                <td>BViT/L16 + API (Late)</td>
                <td>A + I</td>
                <td>94.68 ± 0.08</td>
                <td>89.85 ± 0.11</td>
                <td>95.41 ± 0.07</td>
                <td>90.77 ± 0.09</td>
                <td>95.98 ± 0.06</td>
                <td>93.42 ± 0.08</td>
              </tr>
              <tr>
                <td>DREAM + API (Late)</td>
                <td>A + I</td>
                <td>94.92 ± 0.08</td>
                <td>90.23 ± 0.10</td>
                <td>95.87 ± 0.06</td>
                <td>91.34 ± 0.08</td>
                <td>96.24 ± 0.05</td>
                <td>93.71 ± 0.07</td>
              </tr>
              <tr>
                <td>MalFSCIL + API (Late)</td>
                <td>A + I</td>
                <td>95.23 ± 0.07</td>
                <td>91.05 ± 0.09</td>
                <td>96.42 ± 0.05</td>
                <td>92.18 ± 0.07</td>
                <td>96.87 ± 0.04</td>
                <td>94.25 ± 0.06</td>
              </tr>
              <tr>
                <td>GNN fusion (Ours)</td>
                <td>A + I</td>
                <td>95.73 ± 0.07</td>
                <td>92.04 ± 0.11</td>
                <td>97.31 ± 0.05</td>
                <td>94.89 ± 0.09</td>
                <td>96.59 ± 0.06</td>
                <td>93.73 ± 0.09</td>
              </tr>
            </tbody>
          </table>
          <table-wrap-foot>
            <fn>
              <p>I and A stand for image and API respectively. API: Application programming interface; GNN: graph neural network.</p>
            </fn>
          </table-wrap-foot>
        </table-wrap>
        <p>We find that simple few-shot learning frameworks such as the prototypical network (ProtoNet) can provide consistent performance across different modalities, proving the effectiveness of ProtoNet as the backbone.</p>
        <p>For unimodal models, we find that API sequences can provide more effective features compared with malware grayscale images. For example, ProtoNet applied to images can only achieve 83.27% accuracy in the 5-way 5-shot setting on VirusShare-M dataset while the same model based on API sequences can reach 85.32%. Recently proposed vision transformer-based methods, such as BViT/B16 and BViT/L16, significantly improve image-based performance, achieving 88.47% and 89.82% respectively on the same task. This confirms the advantage of transformer architectures in capturing spatial representations of malware images.</p>
        <p>More recent advances in malware concept drift detection and few-shot class-incremental learning have demonstrated even stronger performance. DREAM<sup>[<xref ref-type="bibr" rid="B38">38</xref>]</sup>, a concept-based drift detection system for Android malware, achieves 90.15% accuracy when adapted to our image-based few-shot setting. Its model-sensitive concept learning approach enables more effective detection of new malware families. MalFSCIL<sup>[<xref ref-type="bibr" rid="B39">39</xref>]</sup>, a few-shot class-incremental learning framework specifically designed for malware detection, further improves image-based performance to 90.58% on VirusShare-M 5-way 5-shot. This method’s decoupled training strategy and VAE-based feature enhancement effectively mitigate catastrophic forgetting, making it particularly suitable for incremental learning scenarios.</p>
        <p>It should be noted that our evaluation is conducted under a class-disjoint meta-learning protocol, where the training, validation, and test sets contain disjoint malware families. This setup simulates the scenario where new families emerge, and the model must generalize to them with limited support samples. However, this evaluation does not fully capture real-world zero-day scenarios, which may involve significant distribution shifts, concept drift, or entirely novel attack behaviors that differ substantially from training data. Therefore, while our results demonstrate strong performance in few-shot settings, further evaluation under more challenging conditions - such as open-set recognition, domain adaptation, or temporal drift - would be valuable future work.</p>
        <p>For multimodal baselines, we observe that even the simplest methods of multimodal integration - early fusion via concatenation and late fusion at the decision-making stage - significantly outperform models that utilize unimodal information alone. This observation substantiates the effectiveness of leveraging multimodal information for few-shot malware classification tasks. When combining BViT/L16 with API features via late fusion, accuracy further improves to 94.68%, demonstrating that even strong unimodal encoders benefit from multimodal integration. DREAM and MalFSCIL, when combined with API features via late fusion, achieve 94.92% and 95.23% accuracy respectively, approaching the performance of our proposed GNN fusion method.</p>
        <p>Our proposed GNN fusion module achieves superior performance compared to methods that rely on simple feature fusion. Our method reaches 95.73% accuracy on VirusShare-M 5-way 5-shot, outperforming both BViT-based fusion approaches (94.68%) and the more recent DREAM (94.92%) and MalFSCIL (95.23%) fusion variants. This suggests that while advanced single-modality methods like DREAM and MalFSCIL continue to push the boundaries of image-based malware classification, our GNN fusion module’s ability to model fine-grained cross-modal correlations at the segment level provides advantages over simple late fusion, even when using state-of-the-art image encoders. Notably, MalFSCIL’s VAE-based feature enhancement and DREAM’s concept reliability detection offer complementary strengths that could potentially be integrated with our fusion approach in future work. The subsequent section on ablation studies and analysis will discuss how the GNN fusion module operates and provide a comparative evaluation of the training costs associated with various fusion methodologies.</p>
      </sec>
      <sec id="sec5-4">
        <title>5.4. Analysis and ablation study</title>
        <sec id="sec5-4-1">
          <title>5.4.1. Effect of normalization</title>
          <p>As indicated in <xref ref-type="table" rid="t4">Table 4</xref>, we observe that the normalization module proposed in Section 4.4.1 significantly enhances the classification accuracy of multimodal models. This improvement aligns with the original intent behind introducing the module, which is to standardize the numerical ranges of features across multiple modalities to a common scale. This concept is discussed in Section 4.4.1 and will not be elaborated further here.</p>
          <table-wrap id="t4">
            <label>Table 4</label>
            <caption>
              <p>Ablation experiments on effect of normalization</p>
            </caption>
            <table frame="hsides" rules="groups">
              <thead>
                <tr>
                  <td style="border-bottom:1;">
                    <bold>Model</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>Image</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>API</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>Norm.</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>Accuracy</bold>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>ProtoNet</td>
                  <td>√</td>
                  <td />
                  <td />
                  <td>83.27 ± 0.15</td>
                </tr>
                <tr>
                  <td>ProtoNet</td>
                  <td>√</td>
                  <td />
                  <td>√</td>
                  <td>84.64 ± 0.14</td>
                </tr>
                <tr>
                  <td>ProtoNet</td>
                  <td />
                  <td>√</td>
                  <td />
                  <td>85.32 ± 0.14</td>
                </tr>
                <tr>
                  <td>ProtoNet</td>
                  <td />
                  <td>√</td>
                  <td>√</td>
                  <td>87.78 ± 0.12</td>
                </tr>
                <tr>
                  <td>Early fusion</td>
                  <td>√</td>
                  <td>√</td>
                  <td />
                  <td>91.47 ± 0.10</td>
                </tr>
                <tr>
                  <td>Early fusion</td>
                  <td>√</td>
                  <td>√</td>
                  <td>√</td>
                  <td>94.94 ± 0.08</td>
                </tr>
                <tr>
                  <td>GNN fusion</td>
                  <td>√</td>
                  <td>√</td>
                  <td />
                  <td>93.57 ± 0.08</td>
                </tr>
                <tr>
                  <td>GNN fusion</td>
                  <td>√</td>
                  <td>√</td>
                  <td>√</td>
                  <td>95.73 ± 0.07</td>
                </tr>
              </tbody>
            </table>
            <table-wrap-foot>
              <fn>
                <p>API: Application programming interface; GNN: graph neural network.</p>
              </fn>
            </table-wrap-foot>
          </table-wrap>
          <p>Unexpectedly, the normalization module also appears to improve the performance of unimodal models. Further analysis reveals that the benefits of normalization are more pronounced for API sequence information than for images. We hypothesize that normalization promotes a more uniform distribution across each dimension of the features, as demonstrated in <xref ref-type="fig" rid="fig4">Figure 4</xref>. Within a single modality, there are discrepancies in feature values. The role of normalization in this context is analogous to that of layer normalization in transformers; it diminishes the magnitude disparities across samples while preserving the relative differences among features. This is particularly justifiable given that API sequence features of malicious software represent temporal variations of N-gram items, where the relational dynamics within the features are inherently tight. Moreover, we propose that the normalization module can be generalized to any small-sample task, serving as a universal technique to enhance classification outcomes.</p>
        </sec>
        <sec id="sec5-4-2">
          <title>5.4.2. Effect of loss function</title>
          <p>We examine the contributions of the loss function described in Section 4.5. As demonstrated in <xref ref-type="table" rid="t5">Table 5</xref>, incorporating the individual predictive values of each modality into the loss function substantially enhances the performance of the model. Beyond our proposed GNN fusion approach, the application of our designed loss function also improves accuracy by approximately 1% for the simple concatenation fusion model in 5-way-5-shot task on the VirusShare-M dataset.</p>
          <table-wrap id="t5">
            <label>Table 5</label>
            <caption>
              <p>Ablation experiments on effect of loss function</p>
            </caption>
            <table frame="hsides" rules="groups">
              <thead>
                <tr>
                  <td style="border-bottom:1;">
                    <bold>Model</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>Image</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>API</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>MM</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>Accuracy</bold>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>Early fusion</td>
                  <td />
                  <td />
                  <td>√</td>
                  <td>91.47 ± 0.10</td>
                </tr>
                <tr>
                  <td>Early fusion</td>
                  <td>√</td>
                  <td>√</td>
                  <td>√</td>
                  <td>92.78 ± 0.10</td>
                </tr>
                <tr>
                  <td>GNN fusion</td>
                  <td />
                  <td />
                  <td>√</td>
                  <td>95.26 ± 0.08</td>
                </tr>
                <tr>
                  <td>GNN fusion</td>
                  <td>√</td>
                  <td>√</td>
                  <td>√</td>
                  <td>95.73 ± 0.07</td>
                </tr>
              </tbody>
            </table>
            <table-wrap-foot>
              <fn>
                <p>API: Application programming interface; MM: multimodal; GNN: graph neural network.</p>
              </fn>
            </table-wrap-foot>
          </table-wrap>
        </sec>
        <sec id="sec5-4-3">
          <title>5.4.3. Computing cost</title>
          <p>In previous sections, we mentioned that our GNN-based feature fusion module is lightweight. Here we will discuss the specific computing cost of this module. To provide a more direct comparison, we have calculated the FLOPs and number of parameters of the GNN-based feature fusion module and compared it to the feature extraction modules described in <xref ref-type="fig" rid="fig3">Figure 3</xref>. Additionally, to evaluate suitability for real-world deployment scenarios, we measured inference latency and memory usage on a CPU-based system (Intel Core i7-10750H, 16GB RAM) without GPU acceleration. All measurements are averaged over 1,000 runs. The hyperparameter settings for the GNN fusion module are consistent with those discussed in Section 5.2.</p>
          <p>As demonstrated in <xref ref-type="table" rid="t6">Table 6</xref>, using the ptflops library (<uri xlink:href="https://pypi.org/project/ptflops/">https://pypi.org/project/ptflops/</uri>), we quantified the computational metrics. The GNN fusion module only requires 189.86 MMac of multiply-accumulate operations (MACs) and has a remarkably low parameter count of 79.11 k. When converted to percentages, the GNN fusion module increases the model’s MACs by just 0.35% and its parameter volume by 5%, yet achieves effective feature fusion. In terms of deployment efficiency, the GNN fusion module adds only <InlineParagraph>3.2 ms</InlineParagraph> of latency and 55.8 MB of memory overhead, which is acceptable for most real-time applications. This underscores the module’s utility in scenarios where computational resources are limited, and maintaining model simplicity without sacrificing performance is crucial. Such attributes make the GNN fusion module particularly suitable for real-time applications and devices with constrained computational capabilities.</p>
          <table-wrap id="t6">
            <label>Table 6</label>
            <caption>
              <p>Computing cost and deployment efficiency of each module in our framework</p>
            </caption>
            <table frame="hsides" rules="groups">
              <thead>
                <tr>
                  <td style="border-bottom:1;">
                    <bold>Module</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>MACs</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>Params</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>Inference time (ms)</bold>
                  </td>
                  <td style="border-bottom:1;">
                    <bold>Memory (MB)</bold>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>CNN4</td>
                  <td>10.09 GMac</td>
                  <td>95.49 k</td>
                  <td>12.3</td>
                  <td>89.4</td>
                </tr>
                <tr>
                  <td>Simple transformer</td>
                  <td>43.41 GMac</td>
                  <td>1.39 M</td>
                  <td>8.7</td>
                  <td>112.6</td>
                </tr>
                <tr>
                  <td>GNN fusion</td>
                  <td>189.86 MMac</td>
                  <td>79.11 k</td>
                  <td>3.2</td>
                  <td>55.8</td>
                </tr>
              </tbody>
            </table>
            <table-wrap-foot>
              <fn>
                <p>MACs: Multiply-accumulate operations; CNN: convolutional neural network; GNN: graph neural network.</p>
              </fn>
            </table-wrap-foot>
          </table-wrap>
        </sec>
      </sec>
      <sec id="sec5-5">
        <title>5.5. Interpretability analysis</title>
        <p>To understand how our GNN fusion module leverages cross-modal correlations, we analyze the attention weights learned during message passing for representative malware samples.</p>
        <p>
          <bold>Case Study: Zbot Family.</bold> For a Zbot (Zeus) malware sample, the API sequence contains N-gram segments related to registry operations (e.g., “RegOpenKeyEx”-“RegSetValueEx”) and file writes (e.g., “CreateFile”-“WriteFile”). The GNN assigns high attention weights between these API segments and specific image patches from the binary’s code and data sections, where corresponding registry key strings and configuration data are stored. This alignment indicates that the GNN learns to associate behavioral patterns (API calls) with structural evidence (binary content) serving the same malicious purpose.</p>
        <p>
          <bold>Quantitative Analysis.</bold> Across all correctly classified samples in the VirusShare-M test set, registry-related API segments show an average attention weight of 0.342 to code section image patches, compared to 0.156 to other sections (119% higher). For correctly classified samples of the visually similar Swizzor.gen!E and Swizzor.gen!I families, the top-3 attention weights account for 67.3% of total attention, whereas for misclassified samples they account for only 41.8%. Additionally, attention patterns remain stable across different few-shot tasks, with a standard deviation of only 0.042, indicating transferable cross-modal relationships rather than sample-specific overfitting.</p>
        <p>
          <bold>Contrast with Unimodal Models.</bold> An API-only model detects registry manipulation but lacks structural context; an image-only model sees suspicious strings but cannot verify their runtime usage. The GNN fusion bridges this gap by explicitly modeling cross-modal relationships, enabling more robust classification.</p>
        <p>
          <bold>Differentiating Similar Families.</bold> For visually similar families like Swizzor.gen!E and Swizzor.gen!I, correctly classified samples show GNN attention focused on discriminative segment pairs (e.g., unique network API patterns aligned with corresponding API hashes in the binary), while misclassified samples exhibit diffuse attention across many segments.</p>
        <p>
          <bold>Implications for Few-Shot Learning.</bold> The GNN’s ability to capture consistent cross-modal correlations provides additional supervisory signal beyond limited labeled examples, improving generalization in few-shot scenarios.</p>
        <p>In summary, the GNN fusion module enhances interpretability by revealing how behavioral and structural evidence complement each other, supporting both model understanding and analyst trust.</p>
      </sec>
    </sec>
    <sec id="sec6">
      <title>6. CONCLUSIONS</title>
      <p>In this paper, we proposed leveraging multimodal information from malware, specifically API invocation sequences and converted grayscale images, for few-shot malware classification. Additionally, we contributed two datasets to facilitate this research. Furthermore, we introduced a lightweight, GNN-based feature fusion module that constructs a graph network among features from multiple modalities, achieving feature integration through message passing. We conducted evaluations on the two datasets we constructed, and the experimental results demonstrate that utilizing multimodal features significantly enhances the accuracy of few-shot malware classification. The results also indicate that our proposed GNN fusion module effectively integrates features. Additional ablation studies discuss the impact of each component of our module and also show that our feature fusion module achieves effective integration with minimal computational overhead.</p>
    </sec>
  </body>
  <back>
    <sec>
      <title>DECLARATIONS</title>
      <sec>
        <title>Authors’ contributions</title>
        <p>Conceptualization: Ren, Y.; Wang, J.</p>
        <p>Data curation: Wang, J.; Liu, Z.</p>
        <p>Formal analysis: Ren, Y.; Liu, Z.</p>
        <p>Funding acquisition: Wang, P.</p>
        <p>Investigation: Ren, Y.; Wang, P.</p>
        <p>Methodology: Ren, Y.</p>
        <p>Project administration: Wang, P.</p>
        <p>Resources: Wang, P.</p>
        <p>Software: Ren, Y.; Wang, J.</p>
        <p>Supervision: Wang, P.; Liu, Z.</p>
        <p>Validation: Ren, Y.</p>
        <p>Visualization: Ren, Y.; Liu, Z.</p>
        <p>Writing - original draft: Ren, Y.; Liu, Z.; Wang, J.</p>
        <p>Writing - review and editing: Wang, J.; Wang, P.</p>
      </sec>
      <sec>
        <title>Availability of data and materials</title>
        <p>The original contributions presented in the study are included in the article/<inline-supplementary-material content-type="local-data" mimetype="application/pdf" xlink:href="ir6013-SupplementaryMaterials.pdf">Supplementary Materials</inline-supplementary-material>; further inquiries can be directed to the corresponding author(s).</p>
      </sec>
      <sec>
        <title>AI and AI-assisted tools statement</title>
        <p>Not applicable.</p>
      </sec>
      <sec>
        <title>Financial support and sponsorship</title>
        <p>None.</p>
      </sec>
      <sec>
        <title>Conflicts of interest</title>
        <p>All authors declared that there are no conflicts of interest.</p>
      </sec>
      <sec>
        <title>Ethical approval and consent to participate</title>
        <p>Not applicable.</p>
      </sec>
      <sec>
        <title>Consent for publication</title>
        <p>Not applicable.</p>
      </sec>
      <sec>
        <title>Copyright</title>
        <p>© The Author(s) 2026.</p>
      </sec>
	 <sec sec-type="supplementary-material">
      <title>Supplementary Materials</title>
          <supplementary-material content-type="local-data">
                <media xlink:href="ir6013-SupplementaryMaterials.pdf" mimetype="application/pdf">
                        <caption>
                                <p>Supplementary Materials</p>
                        </caption>
                </media>
          </supplementary-material>

          </sec> 
    </sec>
    <ref-list>
      <ref id="B1">
        <label>1</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Stuttard</surname>
              <given-names>D</given-names>
            </name>
            <name>
              <surname>Pinto</surname>
              <given-names>M</given-names>
            </name>
          </person-group>
          <comment>The web application hacker’s handbook: finding and exploiting security flaws. John Wiley &amp; Sons; 2011. <uri xlink:href="https://books.google.com/books?id=NSBHAAAAQBAJ">https://books.google.com/books?id=NSBHAAAAQBAJ</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B2">
        <label>2</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Bhodia</surname>
              <given-names>N</given-names>
            </name>
            <name>
              <surname>Prajapati</surname>
              <given-names>P</given-names>
            </name>
            <name>
              <surname>Di Troia</surname>
              <given-names>F</given-names>
            </name>
            <name>
              <surname>Stamp</surname>
              <given-names>M</given-names>
            </name>
          </person-group>
          <comment>Transfer learning for image-based malware classification. <italic>arXiv</italic> <bold>2019</bold>, arXiv:1903.11551. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.1903.11551">https://doi.org/10.48550/arXiv.1903.11551</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B3">
        <label>3</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Vu</surname>
              <given-names>DL</given-names>
            </name>
            <name>
              <surname>Nguyen</surname>
              <given-names>TK</given-names>
            </name>
            <name>
              <surname>Nguyen</surname>
              <given-names>TV</given-names>
            </name>
            <name>
              <surname>Nguyen</surname>
              <given-names>TN</given-names>
            </name>
            <name>
              <surname>Massacci</surname>
              <given-names>F</given-names>
            </name>
            <name>
              <surname>Phung</surname>
              <given-names>PH</given-names>
            </name>
          </person-group>
          <comment>A convolutional transformation network for malware classification. In <italic>2019 6th NAFOSTED conference on information and computer science (NICS)</italic>. Hanoi, Vietnam, Dec 12-13, 2019. IEEE; 2019. pp. 234-39.</comment>
          <pub-id pub-id-type="doi">10.1109/NICS48868.2019.9023876</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B4">
        <label>4</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Vasan</surname>
              <given-names>D</given-names>
            </name>
            <name>
              <surname>Alazab</surname>
              <given-names>M</given-names>
            </name>
            <name>
              <surname>Wassan</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Safaei</surname>
              <given-names>B</given-names>
            </name>
            <name>
              <surname>Zheng</surname>
              <given-names>Q</given-names>
            </name>
          </person-group>
          <article-title>Image-based malware classification using ensemble of CNN architectures (IMCEC)</article-title>
          <source>Comput Secur</source>
          <year>2020</year>
          <volume>92</volume>
          <fpage>101748</fpage>
          <pub-id pub-id-type="doi">10.1016/j.cose.2020.101748</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B5">
        <label>5</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Kolosnjaji</surname>
              <given-names>B</given-names>
            </name>
            <name>
              <surname>Zarras</surname>
              <given-names>A</given-names>
            </name>
            <name>
              <surname>Webster</surname>
              <given-names>G</given-names>
            </name>
            <name>
              <surname>Eckert</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <comment>Deep learning for classification of malware system call sequences. In <italic>AI 2016: Advances in Artificial Intelligence: 29th Australasian Joint Conference</italic>. Hobart, Australia, December 5-8, 2016. Springer; 2016. pp. 137-49.</comment>
          <pub-id pub-id-type="doi">10.1007/978-3-319-50127-7_11</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B6">
        <label>6</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Wang</surname>
              <given-names>P</given-names>
            </name>
            <name>
              <surname>Tang</surname>
              <given-names>Z</given-names>
            </name>
            <name>
              <surname>Wang</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>A novel few-shot malware classification approach for unknown family recognition with multi-prototype modeling</article-title>
          <source>Comput Secur</source>
          <year>2021</year>
          <volume>106</volume>
          <fpage>102273</fpage>
          <pub-id pub-id-type="doi">10.1016/j.cose.2021.102273</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B7">
        <label>7</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Wang</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Lin</surname>
              <given-names>T</given-names>
            </name>
            <name>
              <surname>Wu</surname>
              <given-names>H</given-names>
            </name>
            <name>
              <surname>Wang</surname>
              <given-names>P</given-names>
            </name>
          </person-group>
          <article-title>AGProto: adaptive graph ProtoNet towards sample adaption for few-shot malware classification</article-title>
          <source>Electronics</source>
          <year>2024</year>
          <volume>13</volume>
          <fpage>935</fpage>
          <pub-id pub-id-type="doi">10.3390/electronics13050935</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B8">
        <label>8</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Snell</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Swersky</surname>
              <given-names>K</given-names>
            </name>
            <name>
              <surname>Zemel</surname>
              <given-names>R</given-names>
            </name>
          </person-group>
          <comment>Prototypical networks for few-shot learning. <italic>arXiv</italic> <bold>2017</bold>, arXiv:1703.05175. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.1703.05175">https://doi.org/10.48550/arXiv.1703.05175</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B9">
        <label>9</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Vinayakumar</surname>
              <given-names>R</given-names>
            </name>
            <name>
              <surname>Alazab</surname>
              <given-names>M</given-names>
            </name>
            <name>
              <surname>Soman</surname>
              <given-names>K</given-names>
            </name>
            <name>
              <surname>Poornachandran</surname>
              <given-names>P</given-names>
            </name>
            <name>
              <surname>Venkatraman</surname>
              <given-names>S</given-names>
            </name>
          </person-group>
          <article-title>Robust intelligent malware detection using deep learning</article-title>
          <source>IEEE Access</source>
          <year>2019</year>
          <volume>7</volume>
          <fpage>46717</fpage>
          <lpage>38</lpage>
          <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2906934</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B10">
        <label>10</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Raff</surname>
              <given-names>E</given-names>
            </name>
            <name>
              <surname>Barker</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Sylvester</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Brandon</surname>
              <given-names>R</given-names>
            </name>
            <name>
              <surname>Catanzaro</surname>
              <given-names>B</given-names>
            </name>
            <name>
              <surname>Nicholas</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <comment>Malware detection by eating a whole EXE. <italic>arXiv</italic> <bold>2017</bold>, arXiv:1710.09435. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.1710.09435">https://doi.org/10.48550/arXiv.1710.09435</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B11">
        <label>11</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Gibert</surname>
              <given-names>D</given-names>
            </name>
            <name>
              <surname>Mateu</surname>
              <given-names>C</given-names>
            </name>
            <name>
              <surname>Planes</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <comment>A hierarchical convolutional neural network for malware classification. In <italic>2019 International Joint Conference on Neural Networks (IJCNN)</italic>. Budapeest, Hungary, Jul 14-19, 2019. IEEE; 2019. pp. 1-8.</comment>
          <pub-id pub-id-type="doi">10.1109/IJCNN.2019.8852469</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B12">
        <label>12</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Raff</surname>
              <given-names>E</given-names>
            </name>
            <name>
              <surname>Zak</surname>
              <given-names>R</given-names>
            </name>
            <name>
              <surname>Cox</surname>
              <given-names>R</given-names>
            </name>
            <etal />
          </person-group>
          <article-title>An investigation of byte n-gram features for malware classification</article-title>
          <source>J Comput Virol Hack Tech</source>
          <year>2018</year>
          <volume>14</volume>
          <fpage>1</fpage>
          <lpage>20</lpage>
          <pub-id pub-id-type="doi">10.1007/s11416-016-0283-1</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B13">
        <label>13</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Cui</surname>
              <given-names>Z</given-names>
            </name>
            <name>
              <surname>Xue</surname>
              <given-names>F</given-names>
            </name>
            <name>
              <surname>Cai</surname>
              <given-names>X</given-names>
            </name>
            <name>
              <surname>Cao</surname>
              <given-names>Y</given-names>
            </name>
            <name>
              <surname>Wang</surname>
              <given-names>Gg</given-names>
            </name>
            <etal />
          </person-group>
          <article-title>Detection of malicious code variants based on deep learning</article-title>
          <source>IEEE Trans Ind Inform</source>
          <year>2018</year>
          <volume>14</volume>
          <fpage>3187</fpage>
          <lpage>96</lpage>
          <pub-id pub-id-type="doi">10.1109/TII.2018.2822680</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B14">
        <label>14</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Jiang</surname>
              <given-names>Y</given-names>
            </name>
            <name>
              <surname>Li</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Wu</surname>
              <given-names>Y</given-names>
            </name>
            <name>
              <surname>Zou</surname>
              <given-names>F</given-names>
            </name>
          </person-group>
          <comment>A novel image-based malware classification model using deep learning. In <italic>International Conference on Neural Information Processing</italic>. Springer; 2019. pp. 150-61.</comment>
          <pub-id pub-id-type="doi">10.1007/978-3-030-36711-4_14</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B15">
        <label>15</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Yuan</surname>
              <given-names>B</given-names>
            </name>
            <name>
              <surname>Wang</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Liu</surname>
              <given-names>D</given-names>
            </name>
            <name>
              <surname>Guo</surname>
              <given-names>W</given-names>
            </name>
            <name>
              <surname>Wu</surname>
              <given-names>P</given-names>
            </name>
            <name>
              <surname>Bao</surname>
              <given-names>X</given-names>
            </name>
          </person-group>
          <article-title>Byte-level malware classification based on markov images and deep learning</article-title>
          <source>Comput Secur</source>
          <year>2020</year>
          <volume>92</volume>
          <fpage>101740</fpage>
          <pub-id pub-id-type="doi">10.1016/j.cose.2020.101740</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B16">
        <label>16</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Sharma</surname>
              <given-names>O</given-names>
            </name>
            <name>
              <surname>Sharma</surname>
              <given-names>A</given-names>
            </name>
            <name>
              <surname>Kalia</surname>
              <given-names>A</given-names>
            </name>
          </person-group>
          <article-title>Windows and IoT malware visualization and classification with deep CNN and Xception CNN using Markov images</article-title>
          <source>J Intell Inf Syst</source>
          <year>2023</year>
          <volume>60</volume>
          <fpage>349</fpage>
          <lpage>75</lpage>
          <pub-id pub-id-type="doi">10.1007/s10844-022-00734-4</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B17">
        <label>17</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Azmoodeh</surname>
              <given-names>A</given-names>
            </name>
            <name>
              <surname>Dehghantanha</surname>
              <given-names>A</given-names>
            </name>
            <name>
              <surname>Choo</surname>
              <given-names>KKR</given-names>
            </name>
          </person-group>
          <article-title>Robust malware detection for internet of (battlefield) things devices using deep eigenspace learning</article-title>
          <source>IEEE Trans Sustain Comput</source>
          <year>2019</year>
          <volume>4</volume>
          <fpage>88</fpage>
          <lpage>95</lpage>
          <pub-id pub-id-type="doi">10.1109/TSUSC.2018.2809665</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B18">
        <label>18</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Zhu</surname>
              <given-names>H</given-names>
            </name>
            <name>
              <surname>Gu</surname>
              <given-names>W</given-names>
            </name>
            <name>
              <surname>Wang</surname>
              <given-names>L</given-names>
            </name>
            <name>
              <surname>Xu</surname>
              <given-names>Z</given-names>
            </name>
            <name>
              <surname>Sheng</surname>
              <given-names>VS</given-names>
            </name>
          </person-group>
          <article-title>Android malware detection based on multi-head squeeze-and-excitation residual network</article-title>
          <source>Expert Syst Appl</source>
          <year>2023</year>
          <volume>212</volume>
          <fpage>118705</fpage>
          <pub-id pub-id-type="doi">10.1016/j.eswa.2022.118705</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B19">
        <label>19</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Sasidharan</surname>
              <given-names>SK</given-names>
            </name>
            <name>
              <surname>Thomas</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <article-title>ProDroid - an Android malware detection framework based on profile hidden Markov model</article-title>
          <source>Pervasive Mob Comput</source>
          <year>2021</year>
          <volume>72</volume>
          <fpage>101336</fpage>
          <pub-id pub-id-type="doi">10.1016/j.pmcj.2021.101336</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B20">
        <label>20</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Lee</surname>
              <given-names>YT</given-names>
            </name>
            <name>
              <surname>Ban</surname>
              <given-names>T</given-names>
            </name>
            <name>
              <surname>Wan</surname>
              <given-names>TL</given-names>
            </name>
            <etal />
          </person-group>
          <comment>Cross platform IoT-malware family classification based on printable strings. In <italic>2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)</italic>. Guangzhou, China, Dec 29 2020 - Jan 01 2021. IEEE; 2020. pp. 775-84.</comment>
          <pub-id pub-id-type="doi">10.1109/TrustCom50675.2020.00106</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B21">
        <label>21</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Han</surname>
              <given-names>W</given-names>
            </name>
            <name>
              <surname>Xue</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Wang</surname>
              <given-names>Y</given-names>
            </name>
            <name>
              <surname>Huang</surname>
              <given-names>L</given-names>
            </name>
            <name>
              <surname>Kong</surname>
              <given-names>Z</given-names>
            </name>
            <name>
              <surname>Mao</surname>
              <given-names>L</given-names>
            </name>
          </person-group>
          <article-title>MalDAE: detecting and explaining malware based on correlation and fusion of static and dynamic characteristics</article-title>
          <source>Comput Secur</source>
          <year>2019</year>
          <volume>83</volume>
          <fpage>208</fpage>
          <lpage>33</lpage>
          <pub-id pub-id-type="doi">10.1016/j.cose.2019.02.007</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B22">
        <label>22</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Xu</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Li</surname>
              <given-names>Y</given-names>
            </name>
            <name>
              <surname>Deng</surname>
              <given-names>RH</given-names>
            </name>
            <name>
              <surname>Xu</surname>
              <given-names>K</given-names>
            </name>
          </person-group>
          <article-title>SDAC: a slow-aging solution for android malware detection using semantic distance based API clustering</article-title>
          <source>IEEE Trans Dependable Secure Comput</source>
          <year>2022</year>
          <volume>19</volume>
          <fpage>1149</fpage>
          <lpage>63</lpage>
          <pub-id pub-id-type="doi">10.1109/TDSC.2020.3005088</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B23">
        <label>23</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Yan</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Ren</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Wang</surname>
              <given-names>W</given-names>
            </name>
            <name>
              <surname>Sun</surname>
              <given-names>L</given-names>
            </name>
            <name>
              <surname>Zhang</surname>
              <given-names>W</given-names>
            </name>
            <name>
              <surname>Yu</surname>
              <given-names>Q</given-names>
            </name>
          </person-group>
          <article-title>A survey of adversarial attack and defense methods for malware classification in cyber security</article-title>
          <source>IEEE Commun Surv Tutor</source>
          <year>2023</year>
          <volume>25</volume>
          <fpage>467</fpage>
          <lpage>96</lpage>
          <pub-id pub-id-type="doi">10.1109/COMST.2022.3225137</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B24">
        <label>24</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Han</surname>
              <given-names>X</given-names>
            </name>
            <name>
              <surname>Yu</surname>
              <given-names>X</given-names>
            </name>
            <name>
              <surname>Pasquier</surname>
              <given-names>T</given-names>
            </name>
            <etal />
          </person-group>
          <comment>SIGL: Securing software installations through deep graph learning. In <italic>30th USENIX Security Symposium (USENIX Security 21)</italic>. USENIX Association; 2021. pp. 2345-62. <uri xlink:href="https://www.usenix.org/conference/usenixsecurity21/presentation/han-xueyuan">https://www.usenix.org/conference/usenixsecurity21/presentation/han-xueyuan</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B25">
        <label>25</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Amer</surname>
              <given-names>E</given-names>
            </name>
            <name>
              <surname>Zelinka</surname>
              <given-names>I</given-names>
            </name>
            <name>
              <surname>El-Sappagh</surname>
              <given-names>S</given-names>
            </name>
          </person-group>
          <article-title>A multi-perspective malware detection approach through behavioral fusion of API call sequence</article-title>
          <source>Comput Secur</source>
          <year>2021</year>
          <volume>110</volume>
          <fpage>102449</fpage>
          <pub-id pub-id-type="doi">10.1016/j.cose.2021.102449</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B26">
        <label>26</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Lin</surname>
              <given-names>K</given-names>
            </name>
            <name>
              <surname>Xu</surname>
              <given-names>X</given-names>
            </name>
            <name>
              <surname>Xiao</surname>
              <given-names>F</given-names>
            </name>
          </person-group>
          <article-title>MFFusion: a multi-level features fusion model for malicious traffic detection based on deep learning</article-title>
          <source>Comput Netw</source>
          <year>2022</year>
          <volume>202</volume>
          <fpage>108658</fpage>
          <pub-id pub-id-type="doi">10.1016/j.comnet.2021.108658</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B27">
        <label>27</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Almashhadani</surname>
              <given-names>AO</given-names>
            </name>
            <name>
              <surname>Carlin</surname>
              <given-names>D</given-names>
            </name>
            <name>
              <surname>Kaiiali</surname>
              <given-names>M</given-names>
            </name>
            <name>
              <surname>Sezer</surname>
              <given-names>S</given-names>
            </name>
          </person-group>
          <article-title>MFMCNS: a multi-feature and multi-classifier network-based system for ransomworm detection</article-title>
          <source>Comput Secur</source>
          <year>2022</year>
          <volume>121</volume>
          <fpage>102860</fpage>
          <pub-id pub-id-type="doi">10.1016/j.cose.2022.102860</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B28">
        <label>28</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Wang</surname>
              <given-names>Q</given-names>
            </name>
            <name>
              <surname>Hassan</surname>
              <given-names>WU</given-names>
            </name>
            <name>
              <surname>Li</surname>
              <given-names>D</given-names>
            </name>
            <etal />
          </person-group>
          <comment>You are what you do: hunting stealthy malware via data provenance analysis. In <italic>Network and Distributed Systems Security (NDSS) Symposium 2020</italic>. San Diego, USA, Feb 23-26, 2020. <uri xlink:href="https://www.ndss-symposium.org/wp-content/uploads/2020/02/24167.pdf">https://www.ndss-symposium.org/wp-content/uploads/2020/02/24167.pdf</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B29">
        <label>29</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Damodaran</surname>
              <given-names>A</given-names>
            </name>
            <name>
              <surname>Di Troia</surname>
              <given-names>F</given-names>
            </name>
            <name>
              <surname>Visaggio</surname>
              <given-names>CA</given-names>
            </name>
            <name>
              <surname>Austin</surname>
              <given-names>TH</given-names>
            </name>
            <name>
              <surname>Stamp</surname>
              <given-names>M</given-names>
            </name>
          </person-group>
          <article-title>A comparison of static, dynamic, and hybrid analysis for malware detection</article-title>
          <source>J Comput Virol Hack Tech</source>
          <year>2017</year>
          <volume>13</volume>
          <fpage>1</fpage>
          <lpage>12</lpage>
          <pub-id pub-id-type="doi">10.1007/s11416-015-0261-z</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B30">
        <label>30</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Gopinath</surname>
              <given-names>M</given-names>
            </name>
            <name>
              <surname>Sethuraman</surname>
              <given-names>SC</given-names>
            </name>
          </person-group>
          <article-title>A comprehensive survey on deep learning based malware detection techniques</article-title>
          <source>Comput Sci Rev</source>
          <year>2023</year>
          <volume>47</volume>
          <fpage>100529</fpage>
          <pub-id pub-id-type="doi">10.1016/j.cosrev.2022.100529</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B31">
        <label>31</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Huang</surname>
              <given-names>X</given-names>
            </name>
            <name>
              <surname>Ma</surname>
              <given-names>L</given-names>
            </name>
            <name>
              <surname>Yang</surname>
              <given-names>W</given-names>
            </name>
            <name>
              <surname>Zhong</surname>
              <given-names>Y</given-names>
            </name>
          </person-group>
          <article-title>A method for windows malware detection based on deep learning</article-title>
          <source>J Sign Process Syst</source>
          <year>2021</year>
          <volume>93</volume>
          <fpage>265</fpage>
          <lpage>73</lpage>
          <pub-id pub-id-type="doi">10.1007/s11265-020-01588-1</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B32">
        <label>32</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Yoo</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Kim</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Kim</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Kang</surname>
              <given-names>BB</given-names>
            </name>
          </person-group>
          <article-title>AI-HydRa: advanced hybrid approach using random forest and deep learning for malware classification</article-title>
          <source>Inform Sci</source>
          <year>2021</year>
          <volume>546</volume>
          <fpage>420</fpage>
          <lpage>35</lpage>
          <pub-id pub-id-type="doi">10.1016/j.ins.2020.08.082</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B33">
        <label>33</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Nguyen</surname>
              <given-names>TN</given-names>
            </name>
            <name>
              <surname>Ngo</surname>
              <given-names>QD</given-names>
            </name>
            <name>
              <surname>Nguyen</surname>
              <given-names>HT</given-names>
            </name>
            <name>
              <surname>Nguyen</surname>
              <given-names>GL</given-names>
            </name>
          </person-group>
          <article-title>An advanced computing approach for IoT-botnet detection in industrial Internet of Things</article-title>
          <source>IEEE Trans Ind Inform</source>
          <year>2022</year>
          <volume>18</volume>
          <fpage>8298</fpage>
          <lpage>306</lpage>
          <pub-id pub-id-type="doi">10.1109/TII.2022.3152814</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B34">
        <label>34</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>O’Shaughnessy</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Sheridan</surname>
              <given-names>S</given-names>
            </name>
          </person-group>
          <article-title>Image-based malware classification hybrid framework based on space-filling curves</article-title>
          <source>Comput Secur</source>
          <year>2022</year>
          <volume>116</volume>
          <fpage>102660</fpage>
          <pub-id pub-id-type="doi">10.1016/j.cose.2022.102660</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B35">
        <label>35</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Kim</surname>
              <given-names>T</given-names>
            </name>
            <name>
              <surname>Kang</surname>
              <given-names>B</given-names>
            </name>
            <name>
              <surname>Rho</surname>
              <given-names>M</given-names>
            </name>
            <name>
              <surname>Sezer</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Im</surname>
              <given-names>EG</given-names>
            </name>
          </person-group>
          <article-title>A multimodal deep learning method for android malware detection using various features</article-title>
          <source>IEEE Trans Inf Forensics Secur</source>
          <year>2019</year>
          <volume>14</volume>
          <fpage>773</fpage>
          <lpage>88</lpage>
          <pub-id pub-id-type="doi">10.1109/TIFS.2018.2866319</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B36">
        <label>36</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Gibert</surname>
              <given-names>D</given-names>
            </name>
            <name>
              <surname>Mateu</surname>
              <given-names>C</given-names>
            </name>
            <name>
              <surname>Planes</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>HYDRA: a multimodal deep learning framework for malware classification</article-title>
          <source>Comput Secur</source>
          <year>2020</year>
          <volume>95</volume>
          <fpage>101873</fpage>
          <pub-id pub-id-type="doi">10.1016/j.cose.2020.101873</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B37">
        <label>37</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Dib</surname>
              <given-names>M</given-names>
            </name>
            <name>
              <surname>Torabi</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Bou-Harb</surname>
              <given-names>E</given-names>
            </name>
            <name>
              <surname>Assi</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <article-title>A multi-dimensional deep learning framework for iot malware classification and family attribution</article-title>
          <source>IEEE Trans Netw Serv Manag</source>
          <year>2021</year>
          <volume>18</volume>
          <fpage>1165</fpage>
          <lpage>77</lpage>
          <pub-id pub-id-type="doi">10.1109/TNSM.2021.3075315</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B38">
        <label>38</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>He</surname>
              <given-names>Y</given-names>
            </name>
            <name>
              <surname>Lei</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Qin</surname>
              <given-names>Z</given-names>
            </name>
            <name>
              <surname>Ren</surname>
              <given-names>K</given-names>
            </name>
            <name>
              <surname>Chen</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <comment>Combating concept drift with explanatory detection and adaptation for Android malware classification. In <italic>Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security</italic>. New York, USA. Association for Computing Machinery; 2025. pp. 978-92.</comment>
          <pub-id pub-id-type="doi">10.1145/3719027.3744792</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B39">
        <label>39</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Chai</surname>
              <given-names>Y</given-names>
            </name>
            <name>
              <surname>Chen</surname>
              <given-names>X</given-names>
            </name>
            <name>
              <surname>Qiu</surname>
              <given-names>J</given-names>
            </name>
            <etal />
          </person-group>
          <article-title>MalFSCIL: a few-shot class-incremental learning approach for malware detection</article-title>
          <source>IEEE Trans Inf Forensics Secur</source>
          <year>2025</year>
          <volume>20</volume>
          <fpage>2999</fpage>
          <lpage>3014</lpage>
          <pub-id pub-id-type="doi">10.1109/TIFS.2024.3516565</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B40">
        <label>40</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Miller</surname>
              <given-names>SJ</given-names>
            </name>
            <name>
              <surname>Howard</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Adams</surname>
              <given-names>P</given-names>
            </name>
            <name>
              <surname>Schwan</surname>
              <given-names>M</given-names>
            </name>
            <name>
              <surname>Slater</surname>
              <given-names>R</given-names>
            </name>
          </person-group>
          <comment>Multi-modal classification using images and text. <italic>SMU Data Sci. Rev.</italic> <bold>2020</bold>, <italic>3</italic>, 6. <uri xlink:href="https://scholar.smu.edu/cgi/viewcontent.cgi?article=1165&amp;context=datasciencereview">https://scholar.smu.edu/cgi/viewcontent.cgi?article=1165&amp;context=datasciencereview</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B41">
        <label>41</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Audebert</surname>
              <given-names>N</given-names>
            </name>
            <name>
              <surname>Herold</surname>
              <given-names>C</given-names>
            </name>
            <name>
              <surname>Slimani</surname>
              <given-names>K</given-names>
            </name>
            <name>
              <surname>Vidal</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <comment>Multimodal deep networks for text and image-based document classification. In <italic>Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019</italic>. Würzburg, Germany, Sep 16-20, 2019. Springer; 2020. pp. 427-43.</comment>
          <pub-id pub-id-type="doi">10.1007/978-3-030-43823-4_35</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B42">
        <label>42</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Antol</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Agrawal</surname>
              <given-names>A</given-names>
            </name>
            <name>
              <surname>Lu</surname>
              <given-names>J</given-names>
            </name>
            <etal />
          </person-group>
          <comment>VQA: visual question answering. 2015. <uri xlink:href="https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf">https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B43">
        <label>43</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Xu</surname>
              <given-names>K</given-names>
            </name>
            <name>
              <surname>Ba</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Kiros</surname>
              <given-names>R</given-names>
            </name>
            <etal />
          </person-group>
          <comment>Show, attend and tell: neural image caption generation with visual attention. In <italic>Proceedings of the 32nd International Conference on Machine Learning, PMLR</italic>. 2015. pp. 2048-57. <uri xlink:href="https://proceedings.mlr.press/v37/xuc15.html">https://proceedings.mlr.press/v37/xuc15.html</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B44">
        <label>44</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Anderson</surname>
              <given-names>P</given-names>
            </name>
            <name>
              <surname>He</surname>
              <given-names>X</given-names>
            </name>
            <name>
              <surname>Buehler</surname>
              <given-names>C</given-names>
            </name>
            <etal />
          </person-group>
          <comment>Bottom-up and top-down attention for image captioning and visual question answering. In <italic>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</italic>. 2018. pp. 6077-86.</comment>
          <pub-id pub-id-type="doi">10.1109/CVPR.2018.00636</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B45">
        <label>45</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Simonyan</surname>
              <given-names>K</given-names>
            </name>
            <name>
              <surname>Zisserman</surname>
              <given-names>A</given-names>
            </name>
          </person-group>
          <comment>Very deep convolutional networks for large-scale image recognition. <italic>arXiv</italic> <bold>2014</bold>, arXiv:1409.1556. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.1409.1556">https://doi.org/10.48550/arXiv.1409.1556</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B46">
        <label>46</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Hochreiter</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Schmidhuber</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>Long short-term memory</article-title>
          <source>Neural Comput</source>
          <year>1997</year>
          <volume>9</volume>
          <fpage>1735</fpage>
          <lpage>80</lpage>
          <pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id>
          <pub-id pub-id-type="pmid">9377276</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B47">
        <label>47</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Yang</surname>
              <given-names>Z</given-names>
            </name>
            <name>
              <surname>He</surname>
              <given-names>X</given-names>
            </name>
            <name>
              <surname>Gao</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Deng</surname>
              <given-names>L</given-names>
            </name>
            <name>
              <surname>Smola</surname>
              <given-names>A</given-names>
            </name>
          </person-group>
          <comment>Stacked attention networks for image question answering. <italic>arXiv</italic> <bold>2015</bold>, arXiv:1511.02274. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.1511.02274">https://doi.org/10.48550/arXiv.1511.02274</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B48">
        <label>48</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Kim</surname>
              <given-names>JH</given-names>
            </name>
            <name>
              <surname>Jun</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Zhang</surname>
              <given-names>BT</given-names>
            </name>
          </person-group>
          <comment>Bilinear attention networks. <italic>arXiv</italic> <bold>2018</bold>, arXiv:1805.07932. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.1805.07932">https://doi.org/10.48550/arXiv.1805.07932</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B49">
        <label>49</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Vaswani</surname>
              <given-names>A</given-names>
            </name>
            <name>
              <surname>Shazeer</surname>
              <given-names>N</given-names>
            </name>
            <name>
              <surname>Parmar</surname>
              <given-names>N</given-names>
            </name>
            <etal />
          </person-group>
          <comment>Attention is all you need. <italic>arXiv</italic> <bold>2017</bold>, arXiv:1706.03762. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.1706.03762">https://doi.org/10.48550/arXiv.1706.03762</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B50">
        <label>50</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Devlin</surname>
              <given-names>J</given-names>
            </name>
            <name>
              <surname>Chang</surname>
              <given-names>MW</given-names>
            </name>
            <name>
              <surname>Lee</surname>
              <given-names>K</given-names>
            </name>
            <name>
              <surname>Toutanova</surname>
              <given-names>K</given-names>
            </name>
          </person-group>
          <comment>BERT: pre-training of deep bidirectional transformers for language understanding. <italic>arXiv</italic> <bold>2018</bold>, arXiv:1810.04805. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.1810.04805">https://doi.org/10.48550/arXiv.1810.04805</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B51">
        <label>51</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Nataraj</surname>
              <given-names>L</given-names>
            </name>
            <name>
              <surname>Karthikeyan</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Jacob</surname>
              <given-names>G</given-names>
            </name>
            <name>
              <surname>Manjunath</surname>
              <given-names>BS</given-names>
            </name>
          </person-group>
          <comment>Malware images: visualization and automatic classification. In <italic>Proceedings of the 8th International Symposium on Visualization for Cyber Security</italic>. Association for Computing Machinery; 2011. p. 1-7.</comment>
          <pub-id pub-id-type="doi">10.1145/2016904.2016908</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B52">
        <label>52</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Seok</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Kim</surname>
              <given-names>H</given-names>
            </name>
          </person-group>
          <article-title>Visualized malware classification based-on convolutional neural network</article-title>
          <source>J Korea Inst Inf Secur Cryptol</source>
          <year>2016</year>
          <volume>26</volume>
          <fpage>197</fpage>
          <lpage>208</lpage>
		  <comment><uri xlink:href="https://www.researchgate.net/publication/301236691_Visualized_Malware_Classification_Based-on_Convolutional_Neural_Network">https://www.researchgate.net/publication/301236691_Visualized_Malware_Classification_Based-on_Convolutional_Neural_Network</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B53">
        <label>53</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Hospedales</surname>
              <given-names>T</given-names>
            </name>
            <name>
              <surname>Antoniou</surname>
              <given-names>A</given-names>
            </name>
            <name>
              <surname>Micaelli</surname>
              <given-names>P</given-names>
            </name>
            <name>
              <surname>Storkey</surname>
              <given-names>A</given-names>
            </name>
          </person-group>
          <comment>Meta-learning in neural networks: a survey. <italic>IEEE Trans. Pattern Anal. Mach. Intell.</italic> <bold>2021</bold>, <italic>44</italic>, 5149-69. <uri xlink:href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9428530">https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9428530</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B54">
        <label>54</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Song</surname>
              <given-names>Y</given-names>
            </name>
            <name>
              <surname>Wang</surname>
              <given-names>T</given-names>
            </name>
            <name>
              <surname>Mondal</surname>
              <given-names>SK</given-names>
            </name>
            <name>
              <surname>Sahoo</surname>
              <given-names>JP</given-names>
            </name>
          </person-group>
          <comment>A comprehensive survey of few-shot learning: evolution, applications, challenges, and opportunities. <italic>arXiv</italic> <bold>2022</bold>, arXiv:2205.06743. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.2205.06743">https://doi.org/10.48550/arXiv.2205.06743</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B55">
        <label>55</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Vinyals</surname>
              <given-names>O</given-names>
            </name>
            <name>
              <surname>Blundell</surname>
              <given-names>C</given-names>
            </name>
            <name>
              <surname>Lillicrap</surname>
              <given-names>T</given-names>
            </name>
            <name>
              <surname>Kavukcuoglu</surname>
              <given-names>K</given-names>
            </name>
            <name>
              <surname>Wierstra</surname>
              <given-names>D</given-names>
            </name>
          </person-group>
          <comment>Matching networks for one shot learning. <italic>arXiv</italic> <bold>2016</bold>, arXiv:1606.04080. Available online: <uri xlink:href="https://doi.org/10.48550/arXiv.1606.04080">https://doi.org/10.48550/arXiv.1606.04080</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
      <ref id="B56">
        <label>56</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Gao</surname>
              <given-names>T</given-names>
            </name>
            <name>
              <surname>Han</surname>
              <given-names>X</given-names>
            </name>
            <name>
              <surname>Liu</surname>
              <given-names>Z</given-names>
            </name>
            <name>
              <surname>Sun</surname>
              <given-names>M</given-names>
            </name>
          </person-group>
          <article-title>Hybrid attention-based prototypical networks for noisy few-shot relation classification</article-title>
          <source>Proc AAAI Conf Artif Intell</source>
          <year>2019</year>
          <volume>33</volume>
          <fpage>6407</fpage>
          <lpage>14</lpage>
          <pub-id pub-id-type="doi">10.1609/aaai.v33i01.33016407</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B57">
        <label>57</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Ndibanje</surname>
              <given-names>B</given-names>
            </name>
            <name>
              <surname>Kim</surname>
              <given-names>KH</given-names>
            </name>
            <name>
              <surname>Kang</surname>
              <given-names>YJ</given-names>
            </name>
            <name>
              <surname>Kim</surname>
              <given-names>HH</given-names>
            </name>
            <name>
              <surname>Kim</surname>
              <given-names>TY</given-names>
            </name>
            <name>
              <surname>Lee</surname>
              <given-names>HJ</given-names>
            </name>
          </person-group>
          <article-title>Cross-method-based analysis and classification of malicious behavior by API calls extraction</article-title>
          <source>Appl Sci</source>
          <year>2019</year>
          <volume>9</volume>
          <fpage>239</fpage>
          <pub-id pub-id-type="doi">10.3390/app9020239</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B58">
        <label>58</label>
        <nlm-citation publication-type="journal">
          <person-group person-group-type="author">
            <name>
              <surname>Belal</surname>
              <given-names>MM</given-names>
            </name>
            <name>
              <surname>Sundaram</surname>
              <given-names>DM</given-names>
            </name>
          </person-group>
          <article-title>Global-local attention-based butterfly vision transformer for visualization-based malware classification</article-title>
          <source>IEEE Access</source>
          <year>2023</year>
          <volume>11</volume>
          <fpage>69337</fpage>
          <lpage>55</lpage>
          <pub-id pub-id-type="doi">10.1109/ACCESS.2023.3293530</pub-id>
        </nlm-citation>
      </ref>
      <ref id="B59">
        <label>59</label>
        <nlm-citation publication-type="book">
          <person-group person-group-type="author">
            <name>
              <surname>Paszke</surname>
              <given-names>A</given-names>
            </name>
            <name>
              <surname>Gross</surname>
              <given-names>S</given-names>
            </name>
            <name>
              <surname>Chintala</surname>
              <given-names>S</given-names>
            </name>
            <etal />
          </person-group>
          <comment>Automatic differentiation in PyTorch. In <italic>31st Conference on Neural Information Processing Systems (NIPS 2017)</italic>. Long Beach, USA. 2017. <uri xlink:href="https://openreview.net/pdf?id=BJJsrmfCZ">https://openreview.net/pdf?id=BJJsrmfCZ</uri>. (accessed 2026-06-08)</comment>
        </nlm-citation>
      </ref>
    </ref-list>
  </back>
</article>