Page 10 - Read Online
P. 10

Page 860                                                             Fan et al. Intell. Robot. 2025, 5, 859-63  https://dx.doi.org/10.20517/ir.2025.44

               Taking autonomous driving as an example, embodied artificial intelligence enables vehicles not only to
               perceive their surroundings through advanced sensors such as LiDAR, Radar, and cameras, but also to
               dynamically adapt to evolving traffic conditions, interact safely with pedestrians, and respond effectively to
                              [4]
               unforeseen events . These intelligent agents leverage contextual understanding to make real-time decisions
               that prioritize both safety and operational efficiency . Similarly, service robots equipped with embodied
                                                            [5]
               intelligence can skillfully navigate domestic and public environments, comprehend natural language
               instructions, and engage in complex interactions with both objects and humans . Such advancements mark
                                                                                  [6]
               a significant step toward the meaningful integration of artificial intelligence into real-world environments.

               This Special Issue aims to demonstrate recent advances in the rapidly evolving field of embodied artificial
               intelligence. We received eight submissions from researchers across the globe, reflecting the growing
               interest and momentum in this area. Following a rigorous peer-review process and valuable feedback from
               expert reviewers, four articles were selected for publication. These contributions primarily address key
               challenges in perception  and localization, presenting novel insights and methodologies that advance the
               capabilities of intelligent embodied systems.

               2. CONTRIBUTED ARTICLES
               Among the four accepted articles, three deep learning-based studies focus on leveraging single-modal neural
                                                         [7]
               networks to perform salient object detection , facial expression recognition , and infrared object
                                                                                     [8]
               detection , respectively. Although targeting different application domains, these studies share a unified
                       [9]
               vision: enabling machines to perceive, interpret, and respond to complex visual environments in real time.
               Each work introduces specialized network architectures designed to enhance perception accuracy while
               maintaining computational efficiency, reflecting a collective push toward deployable artificial intelligence in
               resource-constrained or dynamic conditions. A notable commonality among these studies lies in their
               emphasis on multi-scale feature learning. Whether through attention mechanisms, convolution-transformer
               fusion, or scale-adaptive modules, all three approaches integrate hierarchical visual cues to capture both
               global semantic context and fine-grained structural detail. This multi-scale strategy enables a crucial balance
               between semantic comprehension and spatial precision, an essential attribute for embodied artificial
               intelligence systems operating in unstructured or variable environments. Furthermore, each method
               demonstrates strong empirical performance on public benchmark datasets, underscoring the robustness and
               generalizability of the proposed designs.


                                                                                                         [7]
               Despite these shared foundations, each study makes distinct and complementary contributions. The study
               focuses on lightweight salient object detection, introducing scale-adaptive feature extraction and multi-scale
               feature aggregation modules to achieve an optimal trade-off between efficiency and accuracy. The study
                                                                                                         [8]
               addresses facial expression recognition toward embodied artificial intelligence, proposing a multi-scale
               attention and convolution-transformer fusion network to enhance emotion-aware human-robot
                        [10]
                                           [9]
               interaction . Finally, the study  targets infrared object detection under adverse weather conditions,
               combining the MobileNetV3-YOLOv4 architecture with an image enhancement generative adversarial
               network to ensure high-precision detection on low-power edge devices.

               Taken together, these studies collectively advance the frontier of efficient and adaptive visual perception. By
               addressing complementary aspects of perception, ranging from the semantic understanding of human
               emotions to the structural saliency of objects and the multi-modal robustness required under low-visibility
               conditions, they demonstrate a coherent progression toward unified, context-aware perception systems.
               Such efforts not only contribute to academic exploration but also carry significant practical implications for
               real-world applications, from intelligent vehicles and service robots to next-generation embodied agents
               capable of understanding and interacting with their surroundings in a human-like manner.
   5   6   7   8   9   10   11   12   13   14   15