Page 54 - Read Online

P. 54

Page 49 Qi et al. Intell Robot 2021;1(1):18-57 I http://dx.doi.org/10.20517/ir.2021.02

method toensurethatFLisnotdisrupted. The participatingdevices’attributes, includingcomputingresources
and trust values, etc, are used as part of the environment in RL. In the aggregation of the global model, devices
with high reputation levels will have a greater chance of being considered to reduce the effects of malicious
devices mixed into FL.

5.6.3. Lessons learned from categories of FRL
As discussed above, FRL can be divided into two main categories, i.e., HFRL and VFRL. Currently, most of
the existing research is focused on HFRL, while little attention is devoted to VFRL. The reason for this is
that HFRL has obvious application scenarios, where multiple participants have similar decision-making tasks
with individual environments, such as caching allocation [59] , offloading optimization [58] , and attack monitor-
ing [108] . The participants and coordinator only need to train a similar model with the same state and action
spaces. Consequently, the algorithm design can be implemented and the training convergence can be veri-
fied relatively easily. On the other hand, even though VFRL has a higher degree of technical difficulty at the
algorithm design level, it also has a wide range of possible applications. In a multi-agent scenario, for exam-
ple, a single agent is limited by its ability to observe only part of the environment, whereas the transition of
the environment is determined by the behavior of all the agents. Zhuo et al. [65] assumes agents cannot share
their partial observations of the environment and some agents are unable to receive rewards. The federated
Q-network aggregation algorithm between two agents is proposed for VFRL. The paper [97] specifically applies
both HFRL and VFRL for radio access network slicing. For the same type of services, similar data samples
are trained locally at participating devices, and BSs perform horizontal aggregation to integrate a cooperative
access model by adopting an iterative approach. The terminal device also can optimize the selection of base
stations and network slices based on the global model of VFRL, which aggregates access features generated
by different types of services on the third encrypted party. The method improves the device’s ability to select
the appropriate access points when initiating different types of service requests under restrictions regarding
privacy protection. The feasible implementation of VFRL also provides guidance for future research.

6. OPEN ISSUES AND FUTURE RESEARCH DIRECTIONS
As we presented in the previous section, FRL serves an increasingly important role as an enabler of various
applications. While the FRL-based approach possesses many advantages, there are a number of critical open
issues to consider for future implementation. Therefore, this section focuses on several key challenges, in-
cluding those inherited from FL such as security and communication issues, as well as those unique to FRL.
Research on tackling these issues offers interesting directions for the future.

6.1. Learning convergence in HFRL
In realistic HFRL scenarios, while the agents perform similar tasks, the inherent dynamics for the different
environments in which the agents reside are usually not exactly identically distributed. The slight difference in
thestochasticpropertiesofthetransitionmodelsformultipleagentscouldcausethelearningconvergenceissue.
Onepossiblemethod toaddressthis problemis by adjustingthe frequencyofglobal aggregation, i.e., aftereach
global aggregation, a period of time is left for each agent to fine-tune its local parameters according to its own
environment. Apart from the non-identical environment problem, another interesting and important problem
is how to leverage FL to make RL algorithms converge better and faster. It is well-known that DRL algorithms
could be unstable and diverge, especially when off-policy training is combined with function approximation
and bootstrapping. In FRL, the training curves of some agents could diverge while others converge although
the agents are trained in the exact replicas of the same environment. By leveraging FL, it is envisioned that we
could expedite the training process as well as increase the stability. For example, we could selectively aggregate
the parameters of a subset of agents with a larger potential for convergence, and later transfer the converged
parameterstoalltheagents. Totackletheaboveproblems,severalpossiblesolutionsproposedforFLalgorithm
containscertainreferencesignificance. Forexample, serveroperatorscouldaccountforheterogeneityinherent

49 50 51 52 53 54 55 56 57 58 59