Page 21 - Read Online
P. 21

Zander et al. Complex Eng Syst 2023;3:9  I http://dx.doi.org/10.20517/ces.2023.11  Page 3 of 16


               bases

                                                    If    is       then    is      
               with antecedents       and consequences      ,    = 1, ...,    being fuzzy sets.

                                                    [2]
               Mamdani fuzzy systems were introduced in , with the goal of developing fuzzy rule-based control systems.
               Mamdani fuzzy systems defined by the above rule base evaluate a fuzzy output as

                                                   ′        
                                                    (  ) = ∨        (  ) ∧       (  ).
                                                            =1
               The output of the system is then calculated using a defuzzification as an example
                                                            ∫
                                                                ′
                                                                 (  ) ·       
                                                       (   ) =     ∫
                                                        ′
                                                                   (  )    
                                                                 ′
                                                                
               To model more complex systems, one can define systems with multiple antecedents and connect them with an
               aggregation operator.
               TSKFuzzySystemshavebeenintroducedin [3,32] andarebasedonruleswithafuzzyantecedentandasingleton
               consequence
                                                    If    is       then    =       .
               In this paper, we are using a rule with a singleton consequence; however, TSK systems can also be defined with
               linear or higher-order consequences. The output of a TSK fuzzy system can be calculated as
                                                           ∑   
                                                              =1        (  ) ·      
                                                        (  ) =  ∑      .
                                                                =1        (  )
               In the case of multiple antecedents, the rule base is modified as an example
                                                If    is       and    is       then    =       .

               with the evaluation of the output being
                                                         ∑   
                                                            =1        (  ) ·       (  ) ·      
                                                    (  ,   ) =  ∑          .
                                                              =1        (  ) ·       (  )
               In the present paper, we use TSK fuzzy systems in the context of RL.

               2.2. Reinforcement learning
               2.2.1. Bellman’s Equation
               Bellman equation, named after Richard E. Bellman for his work in [33] , is an equation for controlling systems
               based on states, rewards, and actions, which altogether allow us to compute the value for the state, i.e., the total
               expected reward in a given state, given by


                                                              ∑
                                                                             ′
                                                                        ′
                                            (  ) = max(  (  ,   ) +       (  ,   ,    )  (   ).
                                                    
                 (  ) is called the value of a state   . This function gets updated as the agent learns more information about the
               state, especially which actions (  ) yield the highest cumulative reward. The reward    is given for a state-action
               pair and is not bound by any restrictions.    is a discount factor, typically specified between 0 and 1, with a
               common value of 0.99.    is tuned to balance ”near” and ”far” rewards with lower values prioritizing immediate
               rewards and higher values valuing future rewards more.    is the probability of ending in state    (next state) by
                                                                                             ′
               taking action    when starting from state   .   (   ) is the value of the next state. This recursive definition allows
                                                      ′
               policies to learn which actions to take on a given state to maximize the cumulative reward of future states.
   16   17   18   19   20   21   22   23   24   25   26