Page 40 - Read Online
P. 40
Page 35 Qi et al. Intell Robot 2021;1(1):18-57 I http://dx.doi.org/10.20517/ir.2021.02
and explores. Agents are presumed to be honest but curious, i.e., they honestly follow the learning mechanism
but are curious about private information held by other agents. Due to this, the data used for training is only
stored at the owner and is not transferred to the coordinator. HFRL provides an implementation method for
sharing experiences under the constraints of privacy protection. Additionally, various reasons limit the agent’s
ability to explore the environment in a balanced manner. Participating agents may include heterogeneous
devices. The amount of data collected by each agent is unbalanced due to mobility, observation, energy and
other factors. However, all participants have sufficient computing, storage, and communication capabilities.
These capabilities assist the agent in completing model training, merging, and other basic processes. Finally,
the environment observed by a agent may change dynamically, causing differences in data distribution. The
participatingagentsneedtoupdatethemodelintimetoquicklyadapttoenvironmental changes andconstruct
a personalized local model.
In existing RL studies, some applications that meet the above characteristics can be classified as HFRL. Nadiger
et al. [56] presents a typical HFRL architecture, which includes the grouping policy, the learning policy, and the
federation policy. In this work, RL is used to show the applicability of granular personalization and FL is used
to reduce training time. To demonstrate the effectiveness of the proposed architecture, a non-player character
in the Atari game Pong is implemented and evaluated. In the study from Liu et al. [57] , the authors propose
the lifelong federated reinforcement learning (LFRL) for navigation in cloud robotic systems. It enables the
robot to learn efficiently in a new environment and use prior knowledge to quickly adapt to the changes in
the environment. Each robot trains a local model according to its own navigation task, and the centralized
cloud server implements a knowledge fusion algorithm for upgrading a shared model. In considering that
the local model and the shared model might have different network structures, this paper proposes to apply
transfer learning to improve the performance and efficiency of the shared model. Further, researchers also
focus on HFRL-based applications in the IoT due to the high demand for privacy protection. Ren et al. [58]
suggest deploying the FL architecture between edge nodes and IoT devices for computation offloading tasks.
IoT devices can download RL model from edge nodes and train the local model using own data, including the
remained energy resources and the workload of IoT device, etc. The edge node aggregates the updated private
model into the shared model. Although this method considers privacy protection issues, it requires further
evaluation regarding the cost of communication resources by the model exchange. In addition, the work [59]
proposes a federated deep-reinforcement-learning-based framework (FADE) for edge caching. Edge devices,
including base stations (BSs), can cooperatively learn a predictive model using the first round of training pa-
rameters for local learning, and then upload the local parameters tuned to the next round of global training.
By keeping the training on local devices, the FADE can enable fast training and decouple the learning process
between the cloud and data owner in a distributed-centralized manner. More HFRL-based applications will
be classified and summarized in the next section.
Prior to HFRL, a variety of distributed RL algorithms have been extensively investigated, which are closely
related to HFRL. In general, distributed RL algorithms can be divided into two types: synchronized and
asynchronous. In synchronous RL algorithms, such as Sync-Opt synchronous stochastic optimization (Sync-
[3]
Opt) [60] and parallel advantage actor critic (PAAC) , the agents explore their own environments separately,
andafteranumberofsamplesarecollected, theglobalparametersareupdatedsynchronously. Onthecontrary,
the coordinator will update the global model immediately after receiving the gradient from an arbitrary agent
in asynchronous RL algorithms, rather than waiting for other agents. Several asynchronous RL algorithms
are presented, including A3C [61] , Impala [62] , Ape-X [63] and general reinforcement learning architecture (Go-
[1]
rila) . From the perspective of technology development, HFRL can also be considered security-enhanced
parallel RL. In parallel RL, multiple agents interact with a stochastic environment to seek the optimal policy
for the same task [1,2] . By building a closed loop of data and knowledge in parallel systems, parallel RL helps
determine the next course of action for each agent. The state and action representations are fed into a de-
signed neural network to approximate the action value function [64] . However, parallel RL typically transfers