Page 22 - Read Online
P. 22

Page 4 of 16                  Zander et al. Complex Eng Syst 2023;3:9  I http://dx.doi.org/10.20517/ces.2023.11


               2.2.2. Q-learning
                 -learning [34]  is a model-free RL algorithm, which will learn the   -values or the expected reward for a state.
               Rather than attempt to model the environment,   -learning aims to predict preferable actions to take in a
               specific state by extending the Bellman Equation as




                                            (  ,   ) =   (  ,   ) +    · (  (  ,   ) +    max   (   ,   ) −   (  ,   )).
                                                                          ′
                                                                      

                          are the new   -values of the state-action pair. These new values get updated from the previous   -values
               andadded to   , alearningrate, multiplied bythetemporaldifference(TD). The TD is thecurrent rewardofthe
               state-action pair added to the discount factor    multiplied by the maximum reward that can be earned from
               the next state before the current value of the state-action pair is subtracted. The   -function aims to update the
               action that should be taken for a given state to maximize the cumulative reward.


               2.2.3. Deep Q-learning
               Common techniques for   -learning include creating a Deep Neural Network (NN) for predicting the   -
               function followed by optimization via backpropagation. This model is known as a Deep   -Learning Net-
               work [29]  and is a particular case of Q-learning that relies on a Deep NN architecture. This allows the agent to
               learn continuous states, learn continuous values in discrete states, and generalize to states not yet seen. For
               example, a simple   -learning algorithm could involve creating a lookup table where each element is a state
                                                [6]
               of an environment such as Grid World . However, such an approach does not generalize to more complex
               environments such as StarCraft II [35] ; the practically countless number of possible states renders their storage
               in an extremely inefficient table, and Deep Q-learning handles these tasks elegantly.


               2.2.4. Improvements to Q-learning and Deep Q-learning
               One very common problem with   -learning is sampling inefficiency and over-estimation of the   -values.
               There are several methods to address these problems. Here, we will discuss experience replay, double   -
               learning, and actor-critic architectures.

               Experience replay [36]  allows   -learning agents to be more sample efficient by storing transitions or collections
               of states, actions, rewards, and next states. Instead of exclusively learning from a current state, agents sample
               prior experiences. This allows the agent to revisit states it has already visited and learn more to speed up
               convergence of the agent’s current policy. Popular and powerful extensions to experience replay for improving
               sample efficiency include prioritized experience replay (PER) [37]  and hindsight experience replay (HER) [38] .

               Double Deep   -Learning (DDQN) [39]  is another technique to aid in the stability of training. A second model,
               which is a copy of the original network, is offline (frozen) for a set number of training iterations. The target   -
               values are then sampled from the target network and used to compute the   -values for the online (unfrozen)
               network. One issue with DDQN is the tendency of the online model to move too aggressively towards optimal
               performance and negatively impact the stability of training. To address this, a soft update, known commonly
               as Polyak Updating [40] , is performed on the target network using the weights of the online model multiplied
               by a small value. This form of weight regularization allows the target model to slowly improve over time and
               prevents harsh updates to the learning.


               An actor-critic RL architecture consists of an actor that acts out the current policy and estimates the current
               value policy and a critic that evaluates the current policy and estimates the policy gradient using a TD evalua-
               tion [41] . The system also encompasses a supervisor that controls the balance between exploration (performed
               by the actor) and exploitation (performed by the critic).
   17   18   19   20   21   22   23   24   25   26   27