Page 28 - Read Online
P. 28

Page 10 of 16                 Zander et al. Complex Eng Syst 2023;3:9  I http://dx.doi.org/10.20517/ces.2023.11























                                                            (b) Example of state that would terminate the
                       (a) Example of starting position.     episode.


                                        Figure 7. Snapshots of states for the CartPole  [51]  environment.

















               Figure 8. Flowchart for ANFIS model. NN output undergoes fuzzy rule evaluation and normalization. Separately, network output is trans-
               formed according to learned 1 order parameters and multiplied by the normalized rule outputs. These values are summed and given as
                                       
               output.

               output is expanded in dimensions to match the number of rules. This is then passed through the fuzzy rules to
                                                                                                        
               calculate their firing levels prior to normalization along each output dimension. Then, in the case of a 1 order
                                                                                ℎ
               ANFIS, the input is multiplied by a parameter, and a bias is added. If the 0 order ANFIS is used, the input
               is passed along to the next layer. Next, the normalized firing levels are multiplied by the inputs to form the
               rule evaluation. Finally, the output is summed along each rule base to form the final output with the correct
               dimension.



               3.3.2. CartPole-v1
               The CartPole-v1 environment has four variables for the input: the position of the cart, the velocity of the cart,
               the angle of the pole, and the angular velocity of the pole. The expected output has two actions: move left and
               move right. The job of the models is to learn and maximize the   -values for each state to achieve a maximum
               possible reward of 500, where, in each frame, the agent gets a reward of 1 if the cart and pole are within the
               min and max values for each respective field. A reward of 500 means that during 500 frames, the agent is able
               to balance the pole. To perform an action in the environment, we take the action with the maximum   -value.
   23   24   25   26   27   28   29   30   31   32   33