Page 11 - Read Online
P. 11

Fan et al. Intell. Robot. 2025, 5, 859-63  https://dx.doi.org/10.20517/ir.2025.44     Page 861

                                           [11]
               In addition, the remaining study  explores the parallel implementation of a real-time visual simultaneous
               localization and mapping system through heterogeneous parallel computing. Although it does not rely on
               deep learning, the study shares several conceptual commonalities with the aforementioned works. First, it
               emphasizes computational efficiency and real-time performance, an objective aligned with the other
               contributions that enhance learning and perception efficiency through architectural optimization or multi-
               modal fusion. Second, similar to the deep learning-based studies, it contributes to the advancement of
               embodied intelligence and autonomous systems, where robust perception and low-latency computation are
               critical for deployment in complex real-world environments. Third, all four studies integrate vision with
               high-performance computing strategies to manage large-scale visual data: the deep learning approaches do
               so through large vision models and multi-modal learning, while this work adopts hardware-level
               parallelization. Overall, this study complements the learning-based approaches by addressing computational
               bottlenecks from a systems and hardware perspective, underscoring the importance of real-time, scalable
               visual processing for embodied artificial intelligence.


               3. EXISTING CHALLENGES AND FUTURE RESEARCH TRENDS
               Despite the remarkable progress in embodied artificial intelligence, several key challenges remain
               unresolved. First, most existing frameworks exhibit limited generalization and adaptability when deployed
               in diverse real-world environments. While deep learning-based methods have demonstrated impressive
               capabilities, they often depend heavily on large-scale annotated datasets and struggle with continual
               learning or domain transfer. This reliance frequently leads to catastrophic forgetting during long-term
               autonomous operation. Second, current multi-modal perception systems lack efficient mechanisms for
               cross-modal understanding and fusion, especially in dynamic scenarios where heterogeneous sensor data
               present misaligned spatio-temporal characteristics. Third, intelligent systems face significant constraints
               related to computational and energy efficiency. The widespread dependence on large-scale vision-language
               models and complex neural architectures hampers their real-time applicability on resource-constrained
               platforms, such as mobile robots, aerial drones, and intelligent vehicles. Finally, there remains a gap between
               algorithmic innovation and hardware-level optimization. Most research emphasizes model design while
               neglecting the computational parallelization, scheduling, and optimization strategies required for efficient
               deployment on heterogeneous devices.


               Future research in embodied artificial intelligence is expected to evolve along several promising directions.
               First, lifelong and continuous learning will become a central objective, enabling embodied agents to
               incrementally acquire new skills and adapt to changing environments without retraining from scratch .
                                                                                                       [12]
               Second, multi-modal and cross-domain integration will deepen, aiming to build unified representations that
               effectively combine vision, language, and sensory cues for robust reasoning and decision-making [13-15] . Third,
               the development of large-scale foundation models specifically tailored for embodied perception, particularly
               those inspired by the human dual-stream visual processing hypothesis, will catalyze progress toward
                                                                                    [16]
               generalizable understanding and interaction across diverse tasks and contexts . Fourth, the design of
               lightweight and energy-efficient models will be essential for deploying complex visual-language
               architectures on embedded platforms, necessitating advances in compression, pruning, and knowledge
                                  [17]
               distillation techniques . Fifth, hardware-software co-optimization is expected to receive growing attention,
               with the integration of parallel computing strategies, such as Compute Unified Device Architecture
               (CUDA)-based heterogeneous acceleration and graphics processing unit (GPU)/tensor processing unit
               (TPU) adaptation, playing a vital role in addressing the computational bottlenecks that currently constrain
               real-time embodied intelligence . Finally, future embodied artificial intelligence systems must prioritize
                                          [11]
               interpretability and safety to ensure transparent decision-making and foster trustworthy interactions
               between intelligent agents and humans in complex, unstructured environments.
   6   7   8   9   10   11   12   13   14   15   16