Page 11 - Read Online
P. 11
Fan et al. Intell. Robot. 2025, 5, 859-63 https://dx.doi.org/10.20517/ir.2025.44 Page 861
[11]
In addition, the remaining study explores the parallel implementation of a real-time visual simultaneous
localization and mapping system through heterogeneous parallel computing. Although it does not rely on
deep learning, the study shares several conceptual commonalities with the aforementioned works. First, it
emphasizes computational efficiency and real-time performance, an objective aligned with the other
contributions that enhance learning and perception efficiency through architectural optimization or multi-
modal fusion. Second, similar to the deep learning-based studies, it contributes to the advancement of
embodied intelligence and autonomous systems, where robust perception and low-latency computation are
critical for deployment in complex real-world environments. Third, all four studies integrate vision with
high-performance computing strategies to manage large-scale visual data: the deep learning approaches do
so through large vision models and multi-modal learning, while this work adopts hardware-level
parallelization. Overall, this study complements the learning-based approaches by addressing computational
bottlenecks from a systems and hardware perspective, underscoring the importance of real-time, scalable
visual processing for embodied artificial intelligence.
3. EXISTING CHALLENGES AND FUTURE RESEARCH TRENDS
Despite the remarkable progress in embodied artificial intelligence, several key challenges remain
unresolved. First, most existing frameworks exhibit limited generalization and adaptability when deployed
in diverse real-world environments. While deep learning-based methods have demonstrated impressive
capabilities, they often depend heavily on large-scale annotated datasets and struggle with continual
learning or domain transfer. This reliance frequently leads to catastrophic forgetting during long-term
autonomous operation. Second, current multi-modal perception systems lack efficient mechanisms for
cross-modal understanding and fusion, especially in dynamic scenarios where heterogeneous sensor data
present misaligned spatio-temporal characteristics. Third, intelligent systems face significant constraints
related to computational and energy efficiency. The widespread dependence on large-scale vision-language
models and complex neural architectures hampers their real-time applicability on resource-constrained
platforms, such as mobile robots, aerial drones, and intelligent vehicles. Finally, there remains a gap between
algorithmic innovation and hardware-level optimization. Most research emphasizes model design while
neglecting the computational parallelization, scheduling, and optimization strategies required for efficient
deployment on heterogeneous devices.
Future research in embodied artificial intelligence is expected to evolve along several promising directions.
First, lifelong and continuous learning will become a central objective, enabling embodied agents to
incrementally acquire new skills and adapt to changing environments without retraining from scratch .
[12]
Second, multi-modal and cross-domain integration will deepen, aiming to build unified representations that
effectively combine vision, language, and sensory cues for robust reasoning and decision-making [13-15] . Third,
the development of large-scale foundation models specifically tailored for embodied perception, particularly
those inspired by the human dual-stream visual processing hypothesis, will catalyze progress toward
[16]
generalizable understanding and interaction across diverse tasks and contexts . Fourth, the design of
lightweight and energy-efficient models will be essential for deploying complex visual-language
architectures on embedded platforms, necessitating advances in compression, pruning, and knowledge
[17]
distillation techniques . Fifth, hardware-software co-optimization is expected to receive growing attention,
with the integration of parallel computing strategies, such as Compute Unified Device Architecture
(CUDA)-based heterogeneous acceleration and graphics processing unit (GPU)/tensor processing unit
(TPU) adaptation, playing a vital role in addressing the computational bottlenecks that currently constrain
real-time embodied intelligence . Finally, future embodied artificial intelligence systems must prioritize
[11]
interpretability and safety to ensure transparent decision-making and foster trustworthy interactions
between intelligent agents and humans in complex, unstructured environments.

