Page 40 - Read Online
P. 40

Page 35                             Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02


               and explores. Agents are presumed to be honest but curious, i.e., they honestly follow the learning mechanism
               but are curious about private information held by other agents. Due to this, the data used for training is only
               stored at the owner and is not transferred to the coordinator. HFRL provides an implementation method for
               sharing experiences under the constraints of privacy protection. Additionally, various reasons limit the agent’s
               ability to explore the environment in a balanced manner. Participating agents may include heterogeneous
               devices. The amount of data collected by each agent is unbalanced due to mobility, observation, energy and
               other factors. However, all participants have sufficient computing, storage, and communication capabilities.
               These capabilities assist the agent in completing model training, merging, and other basic processes. Finally,
               the environment observed by a agent may change dynamically, causing differences in data distribution. The
               participatingagentsneedtoupdatethemodelintimetoquicklyadapttoenvironmental changes andconstruct
               a personalized local model.


               In existing RL studies, some applications that meet the above characteristics can be classified as HFRL. Nadiger
               et al. [56]  presents a typical HFRL architecture, which includes the grouping policy, the learning policy, and the
               federation policy. In this work, RL is used to show the applicability of granular personalization and FL is used
               to reduce training time. To demonstrate the effectiveness of the proposed architecture, a non-player character
               in the Atari game Pong is implemented and evaluated. In the study from Liu et al. [57] , the authors propose
               the lifelong federated reinforcement learning (LFRL) for navigation in cloud robotic systems. It enables the
               robot to learn efficiently in a new environment and use prior knowledge to quickly adapt to the changes in
               the environment. Each robot trains a local model according to its own navigation task, and the centralized
               cloud server implements a knowledge fusion algorithm for upgrading a shared model. In considering that
               the local model and the shared model might have different network structures, this paper proposes to apply
               transfer learning to improve the performance and efficiency of the shared model. Further, researchers also
               focus on HFRL-based applications in the IoT due to the high demand for privacy protection. Ren et al. [58]
               suggest deploying the FL architecture between edge nodes and IoT devices for computation offloading tasks.
               IoT devices can download RL model from edge nodes and train the local model using own data, including the
               remained energy resources and the workload of IoT device, etc. The edge node aggregates the updated private
               model into the shared model. Although this method considers privacy protection issues, it requires further
               evaluation regarding the cost of communication resources by the model exchange. In addition, the work [59]
               proposes a federated deep-reinforcement-learning-based framework (FADE) for edge caching. Edge devices,
               including base stations (BSs), can cooperatively learn a predictive model using the first round of training pa-
               rameters for local learning, and then upload the local parameters tuned to the next round of global training.
               By keeping the training on local devices, the FADE can enable fast training and decouple the learning process
               between the cloud and data owner in a distributed-centralized manner. More HFRL-based applications will
               be classified and summarized in the next section.


               Prior to HFRL, a variety of distributed RL algorithms have been extensively investigated, which are closely
               related to HFRL. In general, distributed RL algorithms can be divided into two types: synchronized and
               asynchronous. In synchronous RL algorithms, such as Sync-Opt synchronous stochastic optimization (Sync-
                                                          [3]
               Opt) [60]  and parallel advantage actor critic (PAAC) , the agents explore their own environments separately,
               andafteranumberofsamplesarecollected, theglobalparametersareupdatedsynchronously. Onthecontrary,
               the coordinator will update the global model immediately after receiving the gradient from an arbitrary agent
               in asynchronous RL algorithms, rather than waiting for other agents. Several asynchronous RL algorithms
               are presented, including A3C [61] , Impala [62] , Ape-X [63]  and general reinforcement learning architecture (Go-
                   [1]
               rila) . From the perspective of technology development, HFRL can also be considered security-enhanced
               parallel RL. In parallel RL, multiple agents interact with a stochastic environment to seek the optimal policy
               for the same task [1,2] . By building a closed loop of data and knowledge in parallel systems, parallel RL helps
               determine the next course of action for each agent. The state and action representations are fed into a de-
               signed neural network to approximate the action value function [64] . However, parallel RL typically transfers
   35   36   37   38   39   40   41   42   43   44   45