Page 122 - Read Online

P. 122

Boin et al. Intell Robot 2022;2(2):14567 I http://dx.doi.org/10.20517/ir.2022.11 Page 147

1.1.1. Federated reinforcement learning
There are two main areas of research in FRL currently: horizontal federated reinforcement learning (HFRL),
and vertical federated reinforcement learning (VFRL). HFRL has been selected as the algorithm of choice
for the purposes of this study. HFRL and VFRL differ with respect to the structure of their environments and
aggregationmethods. AllagentsinanHFRLarchitectureuseisolatedenvironments. Itfollowsthateachagent’s
action in an HFRL system has no effect on the other agents in the system. An HFRL architecture proposes
the following training cycle for each agent: first, a training step is performed locally, second, environment
specific parameters are uploaded to the aggregation server, and lastly, parameters are aggregated according to
the aggregation method and returned to each agent in the system for another local training step. HFRL may
be noted to have similarities to “Parallel RL”. Parallel RL is a long studied field of RL, where agent gradients are
transferred amongst each other [5,10,11] .

Reinforcement learning is often a sequential learning process, and as such data is often non-IID with a small
sample space [12] . HFRL provides the ability to aggregate experience while increasing the sample efficiency,
thus providing more accurate and stable learning [13] . Some of the current works applying HFRL to a variety
of applications are summarized below.

A study by Lim et al. aims to increase the performance of RL methods applied to multi-IoT device systems.
RL models trained on single devices are often unable to control devices in a similar albeit slightly different
[5]
[5]
environment . Currently, multiple devices need to be trained separately using separate RL agents . The
methods proposed by Lim et al. sped up the learning process by 1.5 times for a two agent system. In a study
by Nadiger et al., the challenges in the personalization of dialogue managers, smart assistants and more are
explored. RLhasproventobesuccessfulinpracticeforpersonalizedexperiences; however, longlearningtimes
and no sharing of data limit the ability for RL to be applied at scale. Applying HFRL to atari non-playable
characters in pong showed a median improvement of 17% for the personalization time [10] . Lastly, Liu et al.
discuss RL as a promising algorithm for smart navigation systems, with the following challenges: long training
times, poor generalization across environments, and storing data over long periods of time [14] . In order to
address these problems, Liu et al. proposed the architecture ‘Lifelong FRL’, which can be categorized as an
HFRL problem. It is found the Lifelong FRL increased the learning rate for smart navigation system when
tested on robots in a cloud robotic system [14] .

The successes of the FedAvg algorithm as a means to improve performance and training times for systems
have inspired further research into how aggregation methods should be applied. The design of the aggregation
method is crucial in providing performance benefits to that of the base case where FRL is not applied. The
FedAvg [3] algorithm proposed the averaging of gradients in the aggregation method. In contrast, Liang et al.
proposed using model weights in the aggregation method for AV steering control [15] . Thus, FRL applications
can differ based upon the selection of which parameter to use in the aggregation method. A study by Zhang
et al. explores applying FRL to a decentralized DRL system optimizing cellular vehicle-to-everything commu-
nication [16] . Zhang et al. utilize model weights in the aggregation method, and describe a weighting factor
dividing the sum batch size for all agents by the training batch size for a specific agent [16] . In addition, the
works of Lim et al. explore how FRL using gradient aggregation can improve convergence speed and perfor-
mance on the OpenAI-gym environments CartPole-V0, MountainvehicleContinuous-V0, Pendulum-V0 and
Acrobot-V1 [17] . Lim et al. determined that aggregating gradients using FRL creates high performing agents
for each of the OpenAI-gym environments relative to models trained without FRL [17] . In addition, Wang et al.
apply FRL to heterogeneous edge caching [18] . Wang et al. show the effectiveness of FRL using weight aggrega-
tion to improve hit rate, reduce average delays in the network and offload traffic [18] . Lastly, Huang et al. apply
FRL using model weight aggregation to Service Function Chains in network function virtualization enabled
networks [19] . Huang et al. observe that FRL using model weight aggregation provides benefits to convergence

117 118 119 120 121 122 123 124 125 126 127