Page 21 - Read Online

P. 21

Zander et al. Complex Eng Syst 2023;3:9 I http://dx.doi.org/10.20517/ces.2023.11 Page 3 of 16

bases

If is then is
with antecedents and consequences , = 1, ..., being fuzzy sets.

[2]
Mamdani fuzzy systems were introduced in , with the goal of developing fuzzy rule-based control systems.
Mamdani fuzzy systems defined by the above rule base evaluate a fuzzy output as

′
( ) = ∨ ( ) ∧ ( ).
=1
The output of the system is then calculated using a defuzzification as an example
∫
′
( ) ·
( ) = ∫
′
( )
′

To model more complex systems, one can define systems with multiple antecedents and connect them with an
aggregation operator.
TSKFuzzySystemshavebeenintroducedin [3,32] andarebasedonruleswithafuzzyantecedentandasingleton
consequence
If is then = .
In this paper, we are using a rule with a singleton consequence; however, TSK systems can also be defined with
linear or higher-order consequences. The output of a TSK fuzzy system can be calculated as
∑
=1 ( ) ·
( ) = ∑ .
=1 ( )
In the case of multiple antecedents, the rule base is modified as an example
If is and is then = .

with the evaluation of the output being
∑
=1 ( ) · ( ) ·
( , ) = ∑ .
=1 ( ) · ( )
In the present paper, we use TSK fuzzy systems in the context of RL.

2.2. Reinforcement learning
2.2.1. Bellman’s Equation
Bellman equation, named after Richard E. Bellman for his work in [33] , is an equation for controlling systems
based on states, rewards, and actions, which altogether allow us to compute the value for the state, i.e., the total
expected reward in a given state, given by

∑
′
′
( ) = max( ( , ) + ( , , ) ( ).

( ) is called the value of a state . This function gets updated as the agent learns more information about the
state, especially which actions ( ) yield the highest cumulative reward. The reward is given for a state-action
pair and is not bound by any restrictions. is a discount factor, typically specified between 0 and 1, with a
common value of 0.99. is tuned to balance ”near” and ”far” rewards with lower values prioritizing immediate
rewards and higher values valuing future rewards more. is the probability of ending in state (next state) by
′
taking action when starting from state . ( ) is the value of the next state. This recursive definition allows
′
policies to learn which actions to take on a given state to maximize the cumulative reward of future states.

16 17 18 19 20 21 22 23 24 25 26