Page 129 - Read Online

P. 129

Page 154 Boin et al. Intell Robot 2022;2(2):14567 I http://dx.doi.org/10.20517/ir.2022.11

defined by the cutoff ratio as seen in Table 3. Currently, Algorithm 1 is synchronous, and the FRL server is
also synchronous.

3. EXPERIMENTAL RESULTS
Inthissection,theexperimentalsetupforapplyingbothInterandIntra-FRLtotheAVplatooningenvironment
is presented. The AV platooning environment and Inter/Intra FRL algorithms are implemented in Python 3.7
using Tensorflow 2.

3.1. Experimental setup
The parameters specific to the AV platoon environment are summarized in Table 1. The time step interval is
= 0.1 , and each training episode is composed of 600 time steps. Furthermore, the coefficients , , and
given in the reward function (10) are a means to define how much each component of (10) contributes to
the calculation of the reward. These coefficients may be tuned in order to determine a balance amongst each
component, leading to better optimization during training. The coefficients were tuned using a grid search
strategy and are listed as = 0.4, = = = 0.2.

Each DDPG agent consists of a replay buffer, and networks for the actor, target actor, critic and target critic.
The actor network contains four layers: an input layer for the state, two hidden layers with 256 and 128 nodes,
respectively, and an output layer. Both hidden layers use batch normalization and the relu activation function.
Theoutputlayerusesthetanh()activationfunction. Theoutputlayerisscaledbythehighboundforthecontrol
output, in thiscase2.5 / . Thecriticnetworkisstructured withtwoseparateinputlayersfor stateandaction.
2
These two layers are concatenated together, and fed into a single hidden layer before the output layer. The layer
with the state input has 48 nodes, the relu activation function and batch normalization. The same is applied for
the action layer, but instead with 256 nodes. The post concatenation layer uses 304 input nodes, followed by
a hidden layer with 128 nodes, again with relu activation and batch normalization applied. The output of the
critic uses a linear activation function. Ornstein-Uhlenbeck noise is applied to the model’s predicted action, .
The structure of the models is presented in Figure 5a and 5b. All except the final layers of the actor and critic
[ ]
1 1 , where-as the final layer is initialized using a random
networks were initialized within the range − √ , √

uniform distribution bounded by [−3 × 10 , 3 × 10 ]. Table 2 presents the hyperparameters specific to the
−3
−3
DDPG algorithm.
The hyperparameters specific to Inter and Intra-FRL are presented in Table 3. During a training session with
FRL, both local updates and FRL updates with aggregated parameters are applied to each DDPG agent in the
system. FRL updates usually occur at a given frequency known as the FRL update delay, and furthermore, FRL
updates may be terminated at a specific training episode as defined by the FRL cutoff ratio. The FRL update
delay is defined as the time in seconds between FRL updates during a training episode. The FRL cutoff ratio
is the ratio of the number of episodes where FRL updates are applied divided by the total number of episodes
in a training session. Note that the aggregation method denotes whether the model gradients or weights are
averaged during training using FRL.

For the purposes of this study, an experiment is defined as a training session for a specific configuration of
hyper-parameters, using the algorithm defined in Algorithm 1. During each experiment training session,
model parameters were trained through the base DDPG algorithm or FRL in accordance with Algorithm
1. Once training has concluded, a simulation is performed using a custom built evaluator API. The evaluator

124 125 126 127 128 129 130 131 132 133 134