Page 39 - Read Online
P. 39

Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02      Page 34



                                            5    Global Model
                                          Send
                                                      4   Aggregation
                             Updated Model             Coordinator
                                         1
                                     Initialization
                                                                 Submit
                                                3  Submit
                              6                               2
                            Update           Local Model A      Train
                                       Exploration /   Action  Reward         Local Model B
                                       Exploitation
                                     State S(t)           State S(t+1)
                          Agent A                Observation                     Agent B

                                           Environment A                      Environment B


                                 Figure 10. An example of horizontal federated reinforcement learning architecture.


               For a better understanding of HFRL, Figure 10 shows an example of HFRL architecture using the server-client
               model. The coordinator is responsible for establishing encrypted communication with agents and implement-
               ing aggregation of shared models. The multiple parallel agents may be composed of heterogeneous equipment
               (e.g., IoT devices, smart phone and computers, etc.) and distributed geographically. It is worth noting that
               there is no specific requirement for the number of agents, and agents are free to choose to join or leave. The
               basic procedure for conducting HFRL can be summarized as follows.

                • Step 1: The initialization/join process can be divided into two cases, one is when the agent has no model
                  locally,andtheotheriswhentheagenthasamodellocally. Forthefirstcase,theagentcandirectlydownload
                  the shared global model from a coordinator. For the second case, the agent needs to confirm the model
                  type and parameters with the central coordinator.
                • Step 2: Each agent independently observes the state of the environment and determines the private strategy
                  based on the local model. The selected action is evaluated by the next state and received reward. All agents
                  train respective models in state-action-reward-state (SARS) cycles.
                • Step 3: Local model parameters are encrypted and transmitted to the coordinator. Agents may submit local
                  models at any time as long as the trigger conditions are met.
                • Step 4: The coordinator conducts the specific aggregation algorithm to evolve the global federated model.
                  Actually, there is no need to wait for submissions from all agents, and appropriate aggregation conditions
                  can be formulated depending on communication resources.
                • Step 5: The coordinator sends back the aggregated model to the agents.
                • Step 6: The agents improve their respective models by fusing the federated model.

               Following the above architecture and process, applications suitable for HFRL should meet the following char-
               acteristics. First, agents have similar tasks to make decisions under dynamic environments. Different from
               the FL setting, the goal of the HFRL-based application is to find the optimal strategy to maximize reward in
               the future. For the agent to accomplish the task requirements, the optimal strategy directs them to perform
               certain actions, such as control, scheduling, navigation, etc. Second, distributed agents maintain independent
               observations. Eachagentcanonlyobservetheenvironmentwithinitsfieldofview, butdoesnotensurethatthe
               collected data follows the same distribution. Third, it is important to protect the data that each agent collects
   34   35   36   37   38   39   40   41   42   43   44