Page 54 - Read Online
P. 54

Page 152                         Boin et al. Intell Robot 2022;2(2):145­67  I http://dx.doi.org/10.20517/ir.2022.11






















                               Figure 2. High level flow diagram of the DDPG model for a general vehicle       in a platoon.


               below, where      ,    and    are system hyperparameters.
                                          (                                     ¤    )
                                             |       ,   |  |       ,   |  |     ,   |  |     ,   |
                                         ,   = −     +         +         +                             (10)
                                            max(       ,   )  max(       ,   )  max(     ,   )  2 max(     ,   )


               2.3. FRL DDPG algorithm
               In this section, the design for implementing the FRL DDPG algorithm on the AV platooning problem is pre-
               sented.



               2.3.1. DDPG model description
               The DDPG algorithm is composed of an actor,    and a critic,   . The actor produces actions       ∈ U given
               some observation       ∈ X and the critic makes judgements on those actions while training using the Bellman
               equation [12,24] . The actor is updated by the policy gradient [24] . The critic network uses its weights    to ap-
                                                                                                      
                                                               [24]                             
               proximate the optimal action-value function   (  ,   |   )  . The actor network uses weights    to represent
                                           
               the agents’ current policy   (  |   ) for the action-value function [24] . The actor   (  ) : X −→ U maps the obser-
               vation to the action. Experience replay is used to mitigate the issue of training samples not being independent
               and identically distributed due to their generation from sequential explorations [24] . Two additional models,
               the target actor    and critic    are used in DDPG to stabilize the training of the actor and critic networks by
                                         0
                              0
               updating parameters slowly based on the target update coefficient   . A sufficient value of    is chosen such that
               stable training of    and    is observed. Figure 2 provides a high level simplified overview of how the DDPG
               algorithm interacts with a single vehicle in a platoon.



               2.3.2. Inter and intra FRL
               Modifications to the base DDPG algorithm are needed in order to implement Inter-FRL and Intra-FRL. In
               order to implement FedAvg the following modifications are required:
               1. An FRL server: responsible for averaging the system parameters for use in a global update
               2. Model weight aggregation: storing of each model’s weights for use in aggregation
               3. Model gradient aggregation: storing of each model’s gradients for use in aggregation
   49   50   51   52   53   54   55   56   57   58   59