Page 14 - Read Online
P. 14
He et al. Intell. Robot. 2025, 5(2), 313-32 Intelligence & Robotics
DOI: 10.20517/ir.2025.16
Research Article Open Access
MSAFNet: a novel approach to facial expression recog-
nition in embodied AI systems
1
2
Huifang He , Runbin Liao , Yating Li 3
1 School of Information Engineering, Guangdong Engineering Polytechnic, Guangzhou 510520, Guangdong, China.
2 School of Physics and Telecommunication Engineering, South China Normal University, Guangzhou 510006, Guangdong, China.
3 School of Data Science and Engineering, and Xingzhi College, South China Normal University, Shanwei 516600, Guangdong, China.
Correspondence to: Prof. Huifang He, School of Information Engineering, Guangdong Engineering Polytechnic, No. 123 Keji Road,
Tianhe District, Guangzhou 510520, Guangdong, China. E-mail: hehuifang@gdep.edu.cn; ORCID: 0000-0002-7204-0736
How to cite this article: He, H.; Liao, R.; Li, Y. MSAFNet: a novel approach to facial expression recognition in embodied AI systems.
Intell. Robot. 2025, 5(2), 313-32. http://dx.doi.org/10.20517/ir.2025.16
Received: 24 Oct 2024 First Decision: 13 Feb 2025 Revised: 11 Mar 2025 Accepted: 13 Mar 2025 Published: 11 Apr 2025
Academic Editor: Simon Yang Copy Editor: Pei-Yun Wang Production Editor: Pei-Yun Wang
Abstract
In embodied artificial intelligence (EAI), accurately recognizing human facial expressions is crucial for intuitive and
effective human-robot interactions. We introduce multi-scale attention and convolution-transformer fusion network,
a deep learning framework tailored for EAI, designed to dynamically detect and process facial expressions, facilitating
adaptive interactions based on the user’s emotional state. The proposed network comprises three distinct compo-
nents: a local feature extraction module that utilizes attention mechanisms to focus on key facial regions, a global
feature extraction module that employs Transformer-based architectures to capture comprehensive global informa-
tion, and a global-local feature fusion module that integrates these insights to enhance facial expression recognition
accuracy. Our experimental results on prominent datasets such as FER2013 and RAF-DB indicate that our data-driven
approach consistently outperforms existing state-of-the-art methods.
Keywords: Facial expression recognition, multi-scale attention, feature fusion, data-driven
1. INTRODUCTION
[1]
Facial expression, serving as one of the most direct and natural social signals in human communication ,
[2]
holds a critical role in interpersonal interactions and is a vital conduit for emotional exchange . The ability to
interpret these expressions accurately is fundamental to the paradigm of embodied artificial intelligence (EAI),
© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0
International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, shar-
ing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you
give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate
if changes were made.
www.oaepublish.com/ir

