Page 122 - Read Online
P. 122

Boin et al. Intell Robot 2022;2(2):145­67  I http://dx.doi.org/10.20517/ir.2022.11  Page 147

               1.1.1. Federated reinforcement learning
               There are two main areas of research in FRL currently: horizontal federated reinforcement learning (HFRL),
               and vertical federated reinforcement learning (VFRL). HFRL has been selected as the algorithm of choice
               for the purposes of this study. HFRL and VFRL differ with respect to the structure of their environments and
               aggregationmethods. AllagentsinanHFRLarchitectureuseisolatedenvironments. Itfollowsthateachagent’s
               action in an HFRL system has no effect on the other agents in the system. An HFRL architecture proposes
               the following training cycle for each agent: first, a training step is performed locally, second, environment
               specific parameters are uploaded to the aggregation server, and lastly, parameters are aggregated according to
               the aggregation method and returned to each agent in the system for another local training step. HFRL may
               be noted to have similarities to “Parallel RL”. Parallel RL is a long studied field of RL, where agent gradients are
               transferred amongst each other [5,10,11] .


               Reinforcement learning is often a sequential learning process, and as such data is often non-IID with a small
               sample space [12] . HFRL provides the ability to aggregate experience while increasing the sample efficiency,
               thus providing more accurate and stable learning [13] . Some of the current works applying HFRL to a variety
               of applications are summarized below.

               A study by Lim et al. aims to increase the performance of RL methods applied to multi-IoT device systems.
               RL models trained on single devices are often unable to control devices in a similar albeit slightly different
                                                                                                   [5]
                          [5]
               environment . Currently, multiple devices need to be trained separately using separate RL agents . The
               methods proposed by Lim et al. sped up the learning process by 1.5 times for a two agent system. In a study
               by Nadiger et al., the challenges in the personalization of dialogue managers, smart assistants and more are
               explored. RLhasproventobesuccessfulinpracticeforpersonalizedexperiences; however, longlearningtimes
               and no sharing of data limit the ability for RL to be applied at scale. Applying HFRL to atari non-playable
               characters in pong showed a median improvement of 17% for the personalization time [10] . Lastly, Liu et al.
               discuss RL as a promising algorithm for smart navigation systems, with the following challenges: long training
               times, poor generalization across environments, and storing data over long periods of time [14] . In order to
               address these problems, Liu et al. proposed the architecture ‘Lifelong FRL’, which can be categorized as an
               HFRL problem. It is found the Lifelong FRL increased the learning rate for smart navigation system when
               tested on robots in a cloud robotic system [14] .


               The successes of the FedAvg algorithm as a means to improve performance and training times for systems
               have inspired further research into how aggregation methods should be applied. The design of the aggregation
               method is crucial in providing performance benefits to that of the base case where FRL is not applied. The
               FedAvg [3]  algorithm proposed the averaging of gradients in the aggregation method. In contrast, Liang et al.
               proposed using model weights in the aggregation method for AV steering control [15] . Thus, FRL applications
               can differ based upon the selection of which parameter to use in the aggregation method. A study by Zhang
               et al. explores applying FRL to a decentralized DRL system optimizing cellular vehicle-to-everything commu-
               nication [16] . Zhang et al. utilize model weights in the aggregation method, and describe a weighting factor
               dividing the sum batch size for all agents by the training batch size for a specific agent [16] . In addition, the
               works of Lim et al. explore how FRL using gradient aggregation can improve convergence speed and perfor-
               mance on the OpenAI-gym environments CartPole-V0, MountainvehicleContinuous-V0, Pendulum-V0 and
               Acrobot-V1 [17] . Lim et al. determined that aggregating gradients using FRL creates high performing agents
               for each of the OpenAI-gym environments relative to models trained without FRL [17] . In addition, Wang et al.
               apply FRL to heterogeneous edge caching [18] . Wang et al. show the effectiveness of FRL using weight aggrega-
               tion to improve hit rate, reduce average delays in the network and offload traffic [18] . Lastly, Huang et al. apply
               FRL using model weight aggregation to Service Function Chains in network function virtualization enabled
               networks [19] . Huang et al. observe that FRL using model weight aggregation provides benefits to convergence
   117   118   119   120   121   122   123   124   125   126   127