Page 129 - Read Online
P. 129

Page 154                         Boin et al. Intell Robot 2022;2(2):145­67  I http://dx.doi.org/10.20517/ir.2022.11

               defined by the cutoff ratio as seen in Table 3. Currently, Algorithm 1 is synchronous, and the FRL server is
               also synchronous.




               3. EXPERIMENTAL RESULTS
               Inthissection,theexperimentalsetupforapplyingbothInterandIntra-FRLtotheAVplatooningenvironment
               is presented. The AV platooning environment and Inter/Intra FRL algorithms are implemented in Python 3.7
               using Tensorflow 2.




               3.1. Experimental setup
               The parameters specific to the AV platoon environment are summarized in Table 1. The time step interval is
                  = 0.1  , and each training episode is composed of 600 time steps. Furthermore, the coefficients   ,   ,    and
                  given in the reward function (10) are a means to define how much each component of (10) contributes to
               the calculation of the reward. These coefficients may be tuned in order to determine a balance amongst each
               component, leading to better optimization during training. The coefficients were tuned using a grid search
               strategy and are listed as    = 0.4,    =    =    = 0.2.

               Each DDPG agent consists of a replay buffer, and networks for the actor, target actor, critic and target critic.
               The actor network contains four layers: an input layer for the state, two hidden layers with 256 and 128 nodes,
               respectively, and an output layer. Both hidden layers use batch normalization and the relu activation function.
               Theoutputlayerusesthetanh()activationfunction. Theoutputlayerisscaledbythehighboundforthecontrol
               output, in thiscase2.5   /   . Thecriticnetworkisstructured withtwoseparateinputlayersfor stateandaction.
                                     2
               These two layers are concatenated together, and fed into a single hidden layer before the output layer. The layer
               with the state input has 48 nodes, the relu activation function and batch normalization. The same is applied for
               the action layer, but instead with 256 nodes. The post concatenation layer uses 304 input nodes, followed by
               a hidden layer with 128 nodes, again with relu activation and batch normalization applied. The output of the
               critic uses a linear activation function. Ornstein-Uhlenbeck noise is applied to the model’s predicted action,      .
               The structure of the models is presented in Figure 5a and 5b. All except the final layers of the actor and critic
                                                  [          ]
                                                      1    1  , where-as the final layer is initialized using a random
               networks were initialized within the range − √  , √
                                                                               
               uniform distribution bounded by [−3 × 10 , 3 × 10 ]. Table 2 presents the hyperparameters specific to the
                                                   −3
                                                           −3
               DDPG algorithm.
               The hyperparameters specific to Inter and Intra-FRL are presented in Table 3. During a training session with
               FRL, both local updates and FRL updates with aggregated parameters are applied to each DDPG agent in the
               system. FRL updates usually occur at a given frequency known as the FRL update delay, and furthermore, FRL
               updates may be terminated at a specific training episode as defined by the FRL cutoff ratio. The FRL update
               delay is defined as the time in seconds between FRL updates during a training episode. The FRL cutoff ratio
               is the ratio of the number of episodes where FRL updates are applied divided by the total number of episodes
               in a training session. Note that the aggregation method denotes whether the model gradients or weights are
               averaged during training using FRL.


               For the purposes of this study, an experiment is defined as a training session for a specific configuration of
               hyper-parameters, using the algorithm defined in Algorithm 1. During each experiment training session,
               model parameters were trained through the base DDPG algorithm or FRL in accordance with Algorithm
               1. Once training has concluded, a simulation is performed using a custom built evaluator API. The evaluator
   124   125   126   127   128   129   130   131   132   133   134