Basic Concepts# Law of Total Probability# Discrete Random Variables# P ( A ) = ∑ i = 1 n P ( B i ) P ( A ∣ B i ) , E ( A ) = ∑ i = 1 n P ( B i ) E ( A ∣ B i ) P(A)=\sum_{i = 1}^{n}P(B_i)P(A|B_i),~E(A)=\sum_{i = 1}^{n}P(B_i)E(A|B_i) P ( A ) = ∑ i = 1 n P ( B i ) P ( A ∣ B i ) , E ( A ) = ∑ i = 1 n P ( B i ) E ( A ∣ B i )
Continuous Random Variables# Given Conditions:
X X X is a continuous random variable with probability density function f ( x ) f(x) f ( x ) .
Y Y Y is another random variable with values y y y .
Conditional probability density function of X X X given Y = y Y = y Y = y is f X ∣ Y ( x ∣ y ) f_{X|Y}(x|y) f X ∣ Y ( x ∣ y ) .
Marginal probability density function of Y Y Y is f Y ( y ) f_Y(y) f Y ( y ) .
Formula:
f X ( x ) = ∫ − ∞ ∞ f Y ( y ) ⋅ f X ∣ Y ( x ∣ y ) d y , E ( X ) = ∫ − ∞ ∞ f Y ( y ) ⋅ E ( X ∣ Y = y ) d y f_X(x)=\int_{-\infty}^{\infty}f_Y(y)\cdot f_{X|Y}(x|y)dy,~E(X)=\int_{-\infty}^{\infty}f_Y(y)\cdot E(X|Y = y)dy f X ( x ) = ∫ − ∞ ∞ f Y ( y ) ⋅ f X ∣ Y ( x ∣ y ) d y , E ( X ) = ∫ − ∞ ∞ f Y ( y ) ⋅ E ( X ∣ Y = y ) d y
State and State space# s s s can include multi - dimensional coordinates, such as velocity, level, temperature and various other states.
State: The status of the agent with respect to the environment.
State space: The set of all states, denoted as S = { s i } i = 1 9 \mathcal{S} = \{s_i\}_{i = 1}^9 S = { s i } i = 1 9 .
Action and Action space# Action:For each state, there are five possible actions: a 1 , ⋯ , a 5 a_1,\cdots,a_5 a 1 , ⋯ , a 5
a 1 a_1 a 1 :move upwards, a 2 a_2 a 2 :move rightwards;
a 3 a_3 a 3 :move downwards, a 4 a_4 a 4 :move leftwards;
a 5 a_5 a 5 :stay unchanged;
Action space: the set of all possible actions of a state, A ( s i ) = { a i } i = 1 5 \mathcal{A}(s_i)=\{a_i\}_{i = 1}^5 A ( s i ) = { a i } i = 1 5
State Transition Probability# State transition defines the interaction with the environment.
State Transition: When an agent takes an action and moves from one state to another, this process is called state transition.
If action a 2 a_2 a 2 (move rightwards) is chosen at state s 1 s_1 s 1 , the next state is s 2 s_2 s 2 , denoted as s 1 → a 2 s 2 . s_1\xrightarrow{a_2}s_2. s 1 a 2 s 2 .
If action a 1 a_1 a 1 (move upwards) is chosen at state s 1 s_1 s 1 , the next state remains s 1 s_1 s 1 , denoted as s 1 → a 1 s 1 . s_1\xrightarrow{a_1}s_1. s 1 a 1 s 1 .
State Transition Probability: Use probability to describe state transition.
p ( s 2 ∣ s 1 , a 2 ) = 1 p(s_2|s_1, a_2) = 1 p ( s 2 ∣ s 1 , a 2 ) = 1 , which means the probability of transitioning from state s 1 s_1 s 1 to state s 2 s_2 s 2 when action a 2 a_2 a 2 is taken is 1 1 1 .
p ( s i ∣ s 1 , a 2 ) = 0 p(s_i|s_1, a_2) = 0 p ( s i ∣ s 1 , a 2 ) = 0 for all i ≠ 2 i\neq2 i = 2 , that is, the probability of transitioning from state s 1 s_1 s 1 to any state other than s 2 s_2 s 2 when taking action a 2 a_2 a 2 is 0 0 0 .
This is a deterministic case. State transition can also be stochastic, for example, p ( s 1 ∣ s 1 , a 2 ) = 0.5 , p ( s 5 ∣ s 1 , a 2 ) = 0.5 p(s_1|s_1, a_2) = 0.5,p(s_5|s_1, a_2) = 0.5 p ( s 1 ∣ s 1 , a 2 ) = 0.5 , p ( s 5 ∣ s 1 , a 2 ) = 0.5 .
Policy# Definition:the agent which actions to perform in a given state.
Intuitive Representation:Arrows in a grid - like illustration can be used to demonstrate a policy. For example, the arrows in the provided grid show the recommended actions from different states.
Mathematical Representation:Use conditional probability (Take state s 1 s_1 s 1 as an example) Deterministic policy Stochastic policy
Left: π ( a 1 ∣ s 1 ) = 0 , π ( a 2 ∣ s 1 ) = 1 , π ( a 3 ∣ s 1 ) = 0 , π ( a 4 ∣ s 1 ) = 0 , π ( a 5 ∣ s 1 ) = 0. \pi(a_1|s_1) = 0,~~\pi(a_2|s_1) = 1,~~\pi(a_3|s_1) = 0,~~\pi(a_4|s_1) = 0,~~\pi(a_5|s_1) = 0. π ( a 1 ∣ s 1 ) = 0 , π ( a 2 ∣ s 1 ) = 1 , π ( a 3 ∣ s 1 ) = 0 , π ( a 4 ∣ s 1 ) = 0 , π ( a 5 ∣ s 1 ) = 0.
Right: π ( a 1 ∣ s 1 ) = 0 , π ( a 2 ∣ s 1 ) = 0.5 , π ( a 3 ∣ s 1 ) = 0.5 , π ( a 4 ∣ s 1 ) = 0 , π ( a 5 ∣ s 1 ) = 0 \pi(a_1|s_1) = 0,~~\pi(a_2|s_1) = 0.5,~~\pi(a_3|s_1) = 0.5,~~\pi(a_4|s_1) = 0,~~\pi(a_5|s_1) = 0 π ( a 1 ∣ s 1 ) = 0 , π ( a 2 ∣ s 1 ) = 0.5 , π ( a 3 ∣ s 1 ) = 0.5 , π ( a 4 ∣ s 1 ) = 0 , π ( a 5 ∣ s 1 ) = 0
Reward# Definition: It is a real number obtained after an agent takes an action.
A positive reward encourages taking such actions, while a negative reward punishes taking them.
Zero reward implies no punishment. In some cases, a positive value can even represent punishment.
Representation Method:Use conditional probability for mathematical description.
Example at State s 1 s_1 s 1 : p ( r = − 1 ∣ s 1 , a 1 ) = 1 p(r = - 1|s_1,a_1)=1 p ( r = − 1∣ s 1 , a 1 ) = 1 and p ( r ≠ − 1 ∣ s 1 , a 1 ) = 0 p(r\neq - 1|s_1,a_1)=0 p ( r = − 1∣ s 1 , a 1 ) = 0 .
Role of Reward: Reward can be regarded as a human - machine interface. It is used to guide the agent to behave as expected.
Trajectory and Return# Trajectory: A trajectory is a chain of state - action - reward. It shows the path an agent takes, like a sequence of states it visits, actions it takes, and rewards it gets.
Return: Also called total rewards or cumulative rewards. It’s the sum of all the rewards collected along a trajectory.
Example Left:starting from s 1 s_1 s 1 ,
The trajectory: s 1 → a 2 r = 0 s 2 → a 3 r = 0 s 5 → a 3 r = 0 s 8 → a 2 r = 1 s 9 . s_1\xrightarrow[a_2]{r = 0}s_2\xrightarrow[a_3]{r = 0}s_5\xrightarrow[a_3]{r = 0}s_8\xrightarrow[a_2]{r = 1}s_9. s 1 r = 0 a 2 s 2 r = 0 a 3 s 5 r = 0 a 3 s 8 r = 1 a 2 s 9 .
Return = 0 + 0 + 0 + 1 = 1 0 + 0+0 + 1=1 0 + 0 + 0 + 1 = 1 .
Example Right:starting from s 1 s_1 s 1 ,
The trajectory: s 1 → a 3 r = 0 s 4 → a 3 r = − 1 s 7 → a 2 r = 0 s 8 → a 2 r = 1 s 9 . s_1\xrightarrow[a_3]{r = 0}s_4\xrightarrow[a_3]{r=-1}s_7\xrightarrow[a_2]{r = 0}s_8\xrightarrow[a_2]{r = 1}s_9. s 1 r = 0 a 3 s 4 r = − 1 a 3 s 7 r = 0 a 2 s 8 r = 1 a 2 s 9 .
Return = 0 − 1 + 0 + 1 = 0 0-1 + 0+1 = 0 0 − 1 + 0 + 1 = 0 .
Infinite Trajectories and Divergence Problem# Infinite Trajectory:Suppose a policy generates an infinitely long trajectory like s 1 → a 2 r = 0 s 2 → a 3 r = 0 s 5 → a 3 r = 0 s 8 → a 2 r = 1 s 9 → a 5 r = 1 s 9 → a 5 r = 1 s 9 ⋯ s_1\xrightarrow[a_2]{r = 0}s_2\xrightarrow[a_3]{r = 0}s_5\xrightarrow[a_3]{r = 0}s_8\xrightarrow[a_2]{r = 1}s_9\xrightarrow[a_5]{r = 1}s_9\xrightarrow[a_5]{r = 1}s_9\cdots s 1 r = 0 a 2 s 2 r = 0 a 3 s 5 r = 0 a 3 s 8 r = 1 a 2 s 9 r = 1 a 5 s 9 r = 1 a 5 s 9 ⋯ If we calculate the return as the direct sum of rewards 0 + 0 + 0 + 1 + 1 + 1 + ⋯ = ∞ 0 + 0+0 + 1+1 + 1+\cdots=\infty 0 + 0 + 0 + 1 + 1 + 1 + ⋯ = ∞
This makes it impossible to properly evaluate the policy using this simple sum.
To deal with infinitely long trajectories, we introduce the concept of discounted return.
Discounted return = 0 + γ 0 + γ 2 0 + γ 3 1 + γ 4 1 + γ 5 1 + ⋯ = γ 3 1 1 − γ , \text{Discounted return} = 0+\gamma0+\gamma^{2}0+\gamma^{3}1+\gamma^{4}1+\gamma^{5}1+\cdots=\gamma^{3}\frac{1}{1 - \gamma}, Discounted return = 0 + γ 0 + γ 2 0 + γ 3 1 + γ 4 1 + γ 5 1 + ⋯ = γ 3 1 − γ 1 ,
where γ ∈ ( 0 , 1 ) \gamma\in(0, 1) γ ∈ ( 0 , 1 ) is the discount rate.
If γ \gamma γ is close to 0 0 0 , the value of the discounted return is mainly determined by the rewards received in the near future. If γ \gamma γ is close to 1 1 1 , the value of the discounted return is mainly determined by the rewards received in the far future. Markov decision process# Key elements of MDP:
Sets:
State: the set of states S \mathcal{S} S
Action: the set of actions A ( s ) \mathcal{A}(s) A ( s ) is associated for state s ∈ S s\in\mathcal{S} s ∈ S .
Reward: the set of rewards R ( s , a ) \mathcal{R}(s, a) R ( s , a ) .
Probability distribution:
State transition probability: at state s s s , taking action a a a , the probability to transit to state s ′ s' s ′ is p ( s ′ ∣ s , a ) p(s'|s, a) p ( s ′ ∣ s , a )
Reward probability: at state s s s , taking action a a a , the probability to get reward r r r is p ( r ∣ s , a ) p(r|s, a) p ( r ∣ s , a )
Policy: at state s s s , the probability to choose action a a a is π ( a ∣ s ) \pi(a|s) π ( a ∣ s )
Markov property: memoryless property
p ( s t + 1 ∣ a t + 1 , s t , … , a 1 , s 0 ) = p ( s t + 1 ∣ a t + 1 , s t ) p(s_{t + 1}|a_{t + 1}, s_t, \dots, a_1, s_0)=p(s_{t + 1}|a_{t + 1}, s_t) p ( s t + 1 ∣ a t + 1 , s t , … , a 1 , s 0 ) = p ( s t + 1 ∣ a t + 1 , s t )
p ( r t + 1 ∣ a t + 1 , s t , … , a 1 , s 0 ) = p ( r t + 1 ∣ a t + 1 , s t ) p(r_{t + 1}|a_{t + 1}, s_t, \dots, a_1, s_0)=p(r_{t + 1}|a_{t + 1}, s_t) p ( r t + 1 ∣ a t + 1 , s t , … , a 1 , s 0 ) = p ( r t + 1 ∣ a t + 1 , s t )
All the concepts introduced in this lecture can be put in the framework in MDP.
State Values and Bellman Equation# Why are returns important?# In fact, returns play a fundamental role in reinforcement learning since they can evaluate whether a policy is good or not.
Following the first policy, the trajectory is s 1 → s 3 → s 4 → s 4 ⋯ s_1\rightarrow s_3\rightarrow s_4\rightarrow s_4\cdots s 1 → s 3 → s 4 → s 4 ⋯ . The corresponding discounted return is return 1 = 0 + γ 1 + γ 2 1 + ⋯ = γ ( 1 + γ + γ 2 + ⋯ ) = γ 1 − γ . \begin{align*} \text{return}_1&=0 + \gamma1+\gamma^{2}1+\cdots\\ &=\gamma(1 + \gamma+\gamma^{2}+\cdots)\\ &=\frac{\gamma}{1 - \gamma}. \end{align*} return 1 = 0 + γ 1 + γ 2 1 + ⋯ = γ ( 1 + γ + γ 2 + ⋯ ) = 1 − γ γ . Following the second policy, the trajectory is s 1 → s 2 → s 4 → s 4 ⋯ s_1\rightarrow s_2\rightarrow s_4\rightarrow s_4\cdots s 1 → s 2 → s 4 → s 4 ⋯ . The discounted return is return 2 = − 1 + γ 1 + γ 2 1 + ⋯ = − 1 + γ ( 1 + γ + γ 2 + ⋯ ) = − 1 + γ 1 − γ . \begin{align*} \text{return}_2&=- 1+\gamma1+\gamma^{2}1+\cdots\\ &=-1+\gamma(1 + \gamma+\gamma^{2}+\cdots)\\ &=-1+\frac{\gamma}{1 - \gamma}. \end{align*} return 2 = − 1 + γ 1 + γ 2 1 + ⋯ = − 1 + γ ( 1 + γ + γ 2 + ⋯ ) = − 1 + 1 − γ γ . Following the third policy, two trajectories can possibly be obtained. One is s 1 → s 3 → s 4 → s 4 ⋯ s_1\rightarrow s_3\rightarrow s_4\rightarrow s_4\cdots s 1 → s 3 → s 4 → s 4 ⋯ , and the other is s 1 → s 2 → s 4 → s 4 ⋯ s_1\rightarrow s_2\rightarrow s_4\rightarrow s_4\cdots s 1 → s 2 → s 4 → s 4 ⋯ . The probability of either of the two trajectories is 0.5 0.5 0.5 . Then, the average return that can be obtained starting from s 1 s_1 s 1 is return 3 = 0.5 ( − 1 + γ 1 − γ ) + 0.5 ( γ 1 − γ ) = − 0.5 + γ 1 − γ . \begin{align*} \text{return}_3&=0.5\left(-1+\frac{\gamma}{1 - \gamma}\right)+0.5\left(\frac{\gamma}{1 - \gamma}\right)\\ &=-0.5+\frac{\gamma}{1 - \gamma}. \end{align*} return 3 = 0.5 ( − 1 + 1 − γ γ ) + 0.5 ( 1 − γ γ ) = − 0.5 + 1 − γ γ . By comparing the returns of the three policies, we notice that
return 1 > return 3 > return 2 . \text{return}_1>\text{return}_3>\text{return}_2. return 1 > return 3 > return 2 . How to calculate returns?# A return equals the discounted sum of all the rewards collected along a Let v i v_i v i denote the return obtained by starting from s i s_i s i for i = 1 , 2 , 3 , 4 i = 1,2,3,4 i = 1 , 2 , 3 , 4 .
By definition# Let v i v_i v i denote the return obtained starting from s i s_i s i (i = 1 , 2 , 3 , 4 i = 1,2,3,4 i = 1 , 2 , 3 , 4 )
v 1 = r 1 + γ r 2 + γ 2 r 3 + ⋯ ; v 2 = r 2 + γ r 3 + γ 2 r 4 + ⋯ ; v 3 = r 3 + γ r 4 + γ 2 r 1 + ⋯ ; v 4 = r 4 + γ r 1 + γ 2 r 2 + ⋯ . \begin{gather*} v_1 = r_1+\gamma r_2+\gamma^{2}r_3+\cdots;\\ v_2 = r_2+\gamma r_3+\gamma^{2}r_4+\cdots;\\ v_3 = r_3+\gamma r_4+\gamma^{2}r_1+\cdots;\\ v_4 = r_4+\gamma r_1+\gamma^{2}r_2+\cdots. \end{gather*} v 1 = r 1 + γ r 2 + γ 2 r 3 + ⋯ ; v 2 = r 2 + γ r 3 + γ 2 r 4 + ⋯ ; v 3 = r 3 + γ r 4 + γ 2 r 1 + ⋯ ; v 4 = r 4 + γ r 1 + γ 2 r 2 + ⋯ . By substitution# v 1 = r 1 + γ ( r 2 + γ r 3 + ⋯ ) = r 1 + γ v 2 ; v 2 = r 2 + γ ( r 3 + γ r 4 + ⋯ ) = r 2 + γ v 3 ; v 3 = r 3 + γ ( r 4 + γ r 1 + ⋯ ) = r 3 + γ v 4 ; v 4 = r 4 + γ ( r 1 + γ r 2 + ⋯ ) = r 4 + γ v 1 . \begin{gather*} v_1 = r_1+\gamma(r_2+\gamma r_3+\cdots)=r_1+\gamma v_2;\\ v_2 = r_2+\gamma(r_3+\gamma r_4+\cdots)=r_2+\gamma v_3;\\ v_3 = r_3+\gamma(r_4+\gamma r_1+\cdots)=r_3+\gamma v_4;\\ v_4 = r_4+\gamma(r_1+\gamma r_2+\cdots)=r_4+\gamma v_1. \end{gather*} v 1 = r 1 + γ ( r 2 + γ r 3 + ⋯ ) = r 1 + γ v 2 ; v 2 = r 2 + γ ( r 3 + γ r 4 + ⋯ ) = r 2 + γ v 3 ; v 3 = r 3 + γ ( r 4 + γ r 1 + ⋯ ) = r 3 + γ v 4 ; v 4 = r 4 + γ ( r 1 + γ r 2 + ⋯ ) = r 4 + γ v 1 . Write in the following matrix - vector form:
[ v 1 v 2 v 3 v 4 ] = [ r 1 r 2 r 3 r 4 ] + [ γ v 2 γ v 3 γ v 4 γ v 1 ] = [ r 1 r 2 r 3 r 4 ] + γ [ 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 ] [ v 1 v 2 v 3 v 4 ] \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \end{bmatrix} = \begin{bmatrix} r_1 \\ r_2 \\ r_3 \\ r_4 \end{bmatrix} + \begin{bmatrix} \gamma v_2 \\ \gamma v_3 \\ \gamma v_4 \\ \gamma v_1 \end{bmatrix} = \begin{bmatrix} r_1 \\ r_2 \\ r_3 \\ r_4 \end{bmatrix} + \gamma \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \end{bmatrix} v 1 v 2 v 3 v 4 = r 1 r 2 r 3 r 4 + γ v 2 γ v 3 γ v 4 γ v 1 = r 1 r 2 r 3 r 4 + γ 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 v 1 v 2 v 3 v 4 which can be rewritten as v = r + γ P v . \mathbf{v}=\mathbf{r}+\gamma\mathbf{Pv}. v = r + γ Pv . We can solve for v v v as v = ( I − γ P ) − 1 r . \mathbf{v}=(I - \gamma \mathbf{P})^{-1}\mathbf{r}. v = ( I − γ P ) − 1 r .
State value# Single step Process# Multi step Trajectory# State value Function# v π 1 ( s 1 ) = 0 + γ 1 + γ 2 1 + ⋯ = γ ( 1 + γ + γ 2 + ⋯ ) = γ 1 − γ v π 2 ( s 1 ) = − 1 + γ 1 + γ 2 1 + ⋯ = − 1 + γ ( 1 + γ + γ 2 + ⋯ ) = − 1 + γ 1 − γ v π 3 ( s 1 ) = 0.5 ( − 1 + γ 1 − γ ) + 0.5 ( γ 1 − γ ) = − 0.5 + γ 1 − γ \begin{align*} &v_{\pi_1}(s_1)=0+\gamma1+\gamma^{2}1+\cdots=\gamma(1 + \gamma+\gamma^{2}+\cdots)=\frac{\gamma}{1 - \gamma}\\ &v_{\pi_2}(s_1)=-1+\gamma1+\gamma^{2}1+\cdots=-1+\gamma(1 + \gamma+\gamma^{2}+\cdots)=-1+\frac{\gamma}{1 - \gamma}\\ &v_{\pi_3}(s_1)=0.5\left(-1+\frac{\gamma}{1 - \gamma}\right)+0.5\left(\frac{\gamma}{1 - \gamma}\right)=-0.5+\frac{\gamma}{1 - \gamma} \end{align*} v π 1 ( s 1 ) = 0 + γ 1 + γ 2 1 + ⋯ = γ ( 1 + γ + γ 2 + ⋯ ) = 1 − γ γ v π 2 ( s 1 ) = − 1 + γ 1 + γ 2 1 + ⋯ = − 1 + γ ( 1 + γ + γ 2 + ⋯ ) = − 1 + 1 − γ γ v π 3 ( s 1 ) = 0.5 ( − 1 + 1 − γ γ ) + 0.5 ( 1 − γ γ ) = − 0.5 + 1 − γ γ Bellman equation# Consider a random trajectory: S t → A t R t + 1 , S t + 1 → A t + 1 R t + 2 , S t + 2 → A t + 2 R t + 3 , ⋯ S_t\xrightarrow{A_t}R_{t + 1},S_{t + 1}\xrightarrow{A_{t + 1}}R_{t + 2},S_{t + 2}\xrightarrow{A_{t + 2}}R_{t + 3},\cdots S t A t R t + 1 , S t + 1 A t + 1 R t + 2 , S t + 2 A t + 2 R t + 3 , ⋯
The return G t G_t G t can be written as
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = R t + 1 + γ ( R t + 2 + γ R t + 3 + ⋯ ) = R t + 1 + γ G t + 1 \begin{align*} G_t&=R_{t + 1}+\gamma R_{t + 2}+\gamma^{2}R_{t + 3}+\cdots\\ &=R_{t + 1}+\gamma(R_{t + 2}+\gamma R_{t + 3}+\cdots)\\ &=R_{t + 1}+\gamma G_{t + 1} \end{align*} G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = R t + 1 + γ ( R t + 2 + γ R t + 3 + ⋯ ) = R t + 1 + γ G t + 1 Then, it follows from the definition of the state value that
v π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] \begin{align*} v_{\pi}(s)&=\mathbb{E}[G_t|S_t = s]\\ &=\mathbb{E}[R_{t + 1}+\gamma G_{t + 1}|S_t = s]\\ &=\mathbb{E}[R_{t + 1}|S_t = s]+\gamma\mathbb{E}[G_{t + 1}|S_t = s] \end{align*} v π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] First, calculate the first term E [ R t + 1 ∣ S t = s ] \mathbb{E}[R_{t + 1}|S_t = s] E [ R t + 1 ∣ S t = s ]
E [ R t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] (当前状态选择任一状态的策略概率) = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r (当前状态和动作到任一状态的回报概率) \begin{align*} \mathbb{E}[R_{t + 1}|S_t = s]&=\sum_{a}\pi(a|s)\mathbb{E}[R_{t + 1}|S_t = s,A_t = a]~~\text{(当前状态选择任一状态的策略概率)}\\ &=\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r~~\text{(当前状态和动作到任一状态的回报概率)} \end{align*} E [ R t + 1 ∣ S t = s ] = a ∑ π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] ( 当前状态选择任一状态的策略概率 ) = a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) r ( 当前状态和动作到任一状态的回报概率 ) Second, calculate the second term E [ G t + 1 ∣ S t = s ] \mathbb{E}[G_{t + 1}|S_t = s] E [ G t + 1 ∣ S t = s ]
E [ G t + 1 ∣ S t = s ] = ∑ s ′ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) (当前状态到任一状态的概率) = ∑ s ′ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) (Markov property) = ∑ s ′ v π ( s ′ ) p ( s ′ ∣ s ) = ∑ s ′ v π ( s ′ ) ∑ a p ( s ′ ∣ s , a ) π ( a ∣ s ) (当前状态到任一状态的策略概率) \begin{align*} \mathbb{E}[G_{t + 1}|S_t = s]&=\sum_{s'}\mathbb{E}[G_{t + 1}|S_t = s,S_{t + 1}=s']p(s'|s)~~\text{(当前状态到任一状态的概率)}\\ &=\sum_{s'}\mathbb{E}[G_{t + 1}|S_{t + 1}=s']p(s'|s)~~\text{(Markov property)}\\ &=\sum_{s'}v_{\pi}(s')p(s'|s)\\ &=\sum_{s'}v_{\pi}(s')\sum_{a}p(s'|s,a)\pi(a|s)~~\text{(当前状态到任一状态的策略概率)} \end{align*} E [ G t + 1 ∣ S t = s ] = s ′ ∑ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) ( 当前状态到任一状态的概率 ) = s ′ ∑ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) (Markov property) = s ′ ∑ v π ( s ′ ) p ( s ′ ∣ s ) = s ′ ∑ v π ( s ′ ) a ∑ p ( s ′ ∣ s , a ) π ( a ∣ s ) ( 当前状态到任一状态的策略概率 ) Therefore, we have
v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] , = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r ⏟ mean of immediate rewards + γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ⏟ mean of future rewards , = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] , ∀ s ∈ S . \begin{align*} v_{\pi}(s)&=\mathbb{E}[R_{t + 1}|S_t = s]+\gamma\mathbb{E}[G_{t + 1}|S_t = s],\\ &=\underbrace{\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r}_{\text{mean of immediate rewards}}+\gamma\underbrace{\sum_{a}\pi(a|s)\sum_{s'}p(s'|s,a)v_{\pi}(s')}_{\text{mean of future rewards}},\\ &=\sum_{a}\pi(a|s)\left[\sum_{r}p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v_{\pi}(s')\right],\quad\forall s\in\mathcal{S}. \end{align*} v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] , = mean of immediate rewards a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) r + γ mean of future rewards a ∑ π ( a ∣ s ) s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) , = a ∑ π ( a ∣ s ) [ r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ] , ∀ s ∈ S . Every state has an equation like this!!!
Example 1
Consider the state value of s 1 s_1 s 1 :
π ( a = a 3 ∣ s 1 ) = 1 \pi(a = a_3|s_1)=1 π ( a = a 3 ∣ s 1 ) = 1 and π ( a ≠ a 3 ∣ s 1 ) = 0 \pi(a\neq a_3|s_1)=0 π ( a = a 3 ∣ s 1 ) = 0 .
p ( s ′ = s 3 ∣ s 1 , a 3 ) = 1 p(s' = s_3|s_1,a_3)=1 p ( s ′ = s 3 ∣ s 1 , a 3 ) = 1 and p ( s ′ ≠ s 3 ∣ s 1 , a 3 ) = 0 p(s'\neq s_3|s_1,a_3)=0 p ( s ′ = s 3 ∣ s 1 , a 3 ) = 0 .
p ( r = 0 ∣ s 1 , a 3 ) = 1 p(r = 0|s_1,a_3)=1 p ( r = 0∣ s 1 , a 3 ) = 1 and p ( r ≠ 0 ∣ s 1 , a 3 ) = 0 p(r\neq 0|s_1,a_3)=0 p ( r = 0∣ s 1 , a 3 ) = 0 .
Substituting them into the Bellman equation gives and similarly, we have
v π ( s 1 ) = 0 + γ v π ( s 3 ) , v π ( s 2 ) = 1 + γ v π ( s 4 ) , v π ( s 3 ) = 1 + γ v π ( s 4 ) , v π ( s 4 ) = 1 + γ v π ( s 4 ) . \begin{gather*} v_{\pi}(s_1)=0+\gamma v_{\pi}(s_3),~~v_{\pi}(s_2)=1+\gamma v_{\pi}(s_4),\\ v_{\pi}(s_3)=1+\gamma v_{\pi}(s_4),~~v_{\pi}(s_4)=1+\gamma v_{\pi}(s_4). \end{gather*} v π ( s 1 ) = 0 + γ v π ( s 3 ) , v π ( s 2 ) = 1 + γ v π ( s 4 ) , v π ( s 3 ) = 1 + γ v π ( s 4 ) , v π ( s 4 ) = 1 + γ v π ( s 4 ) . Solve the above equations one by one from the last to the first:
v π ( s 4 ) = 1 1 − γ , v π ( s 3 ) = 1 1 − γ , v π ( s 2 ) = 1 1 − γ , v π ( s 1 ) = γ 1 − γ . \begin{gather*} v_{\pi}(s_4)=\frac{1}{1 - \gamma},~~v_{\pi}(s_3)=\frac{1}{1 - \gamma},\\ v_{\pi}(s_2)=\frac{1}{1 - \gamma},~~v_{\pi}(s_1)=\frac{\gamma}{1 - \gamma}. \end{gather*} v π ( s 4 ) = 1 − γ 1 , v π ( s 3 ) = 1 − γ 1 , v π ( s 2 ) = 1 − γ 1 , v π ( s 1 ) = 1 − γ γ . Substituting γ = 0.9 \gamma = 0.9 γ = 0.9 yields
v π ( s 4 ) = 10 , v π ( s 3 ) = 10 , v π ( s 2 ) = 10 , v π ( s 1 ) = 9 v_{\pi}(s_4)=10,~v_{\pi}(s_3)=10,~v_{\pi}(s_2)=10,~v_{\pi}(s_1)=9 v π ( s 4 ) = 10 , v π ( s 3 ) = 10 , v π ( s 2 ) = 10 , v π ( s 1 ) = 9 .
Example 2
We have
v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 ) ] + 0.5 [ − 1 + γ v π ( s 2 ) ] , v π ( s 2 ) = 1 + γ v π ( s 4 ) , v π ( s 3 ) = 1 + γ v π ( s 4 ) , v π ( s 4 ) = 1 + γ v π ( s 4 ) \begin{align*} &v_{\pi}(s_1)=0.5[0+\gamma v_{\pi}(s_3)] + 0.5[-1+\gamma v_{\pi}(s_2)],~v_{\pi}(s_2)=1+\gamma v_{\pi}(s_4),\\ &v_{\pi}(s_3)=1+\gamma v_{\pi}(s_4),~v_{\pi}(s_4)=1+\gamma v_{\pi}(s_4)\\ \end{align*} v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 )] + 0.5 [ − 1 + γ v π ( s 2 )] , v π ( s 2 ) = 1 + γ v π ( s 4 ) , v π ( s 3 ) = 1 + γ v π ( s 4 ) , v π ( s 4 ) = 1 + γ v π ( s 4 ) Solve the above equations one by one from the last to the first.
v π ( s 4 ) = 1 1 − γ , v π ( s 3 ) = 1 1 − γ , v π ( s 2 ) = 1 1 − γ , v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 ) ] + 0.5 [ − 1 + γ v π ( s 2 ) ] = − 0.5 + γ 1 − γ . \begin{gather*} v_{\pi}(s_4)=\frac{1}{1 - \gamma},~v_{\pi}(s_3)=\frac{1}{1 - \gamma},~v_{\pi}(s_2)=\frac{1}{1 - \gamma},\\ v_{\pi}(s_1)=0.5[0+\gamma v_{\pi}(s_3)] + 0.5[-1+\gamma v_{\pi}(s_2)]=-0.5+\frac{\gamma}{1 - \gamma}. \end{gather*} v π ( s 4 ) = 1 − γ 1 , v π ( s 3 ) = 1 − γ 1 , v π ( s 2 ) = 1 − γ 1 , v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 )] + 0.5 [ − 1 + γ v π ( s 2 )] = − 0.5 + 1 − γ γ . Substituting γ = 0.9 \gamma = 0.9 γ = 0.9 yields
v π ( s 4 ) = 10 v_{\pi}(s_4)=10 v π ( s 4 ) = 10 , v π ( s 3 ) = 10 v_{\pi}(s_3)=10 v π ( s 3 ) = 10 , v π ( s 2 ) = 10 v_{\pi}(s_2)=10 v π ( s 2 ) = 10 , v π ( s 1 ) = − 0.5 + 9 = 8.5 v_{\pi}(s_1)=-0.5 + 9=8.5 v π ( s 1 ) = − 0.5 + 9 = 8.5 .
Recall that:
v π ( s ) = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] v_{\pi}(s)=\sum_{a}\pi(a|s)\left[\sum_{r}p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v_{\pi}(s')\right] v π ( s ) = a ∑ π ( a ∣ s ) [ r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ] Rewrite the Bellman equation as
v π ( s ) = r π ( s ) + γ ∑ s ′ p π ( s ′ ∣ s ) v π ( s ′ ) v_{\pi}(s)=r_{\pi}(s)+\gamma\sum_{s'}p_{\pi}(s'|s)v_{\pi}(s') v π ( s ) = r π ( s ) + γ s ′ ∑ p π ( s ′ ∣ s ) v π ( s ′ ) where
r π ( s ) ≜ ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r , p π ( s ′ ∣ s ) ≜ ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) r_{\pi}(s)\triangleq\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r,\quad p_{\pi}(s'|s)\triangleq\sum_{a}\pi(a|s)p(s'|s,a) r π ( s ) ≜ a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) r , p π ( s ′ ∣ s ) ≜ a ∑ π ( a ∣ s ) p ( s ′ ∣ s , a ) Suppose the states could be indexed as s i s_i s i (i = 1 , ⋯ , n i = 1,\cdots,n i = 1 , ⋯ , n ).
For state s i s_i s i , the Bellman equation is
v π ( s i ) = r π ( s i ) + γ ∑ s j p π ( s j ∣ s i ) v π ( s j ) v_{\pi}(s_i)=r_{\pi}(s_i)+\gamma\sum_{s_j}p_{\pi}(s_j|s_i)v_{\pi}(s_j) v π ( s i ) = r π ( s i ) + γ s j ∑ p π ( s j ∣ s i ) v π ( s j ) Put all these equations for all the states together and rewrite to a matrix - vector form
v π = r π + γ P π v π v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi} v π = r π + γ P π v π where
v π = [ v π ( s 1 ) , ⋯ , v π ( s n ) ] T ∈ R n v_{\pi}=[v_{\pi}(s_1),\cdots,v_{\pi}(s_n)]^{T}\in\mathbb{R}^{n} v π = [ v π ( s 1 ) , ⋯ , v π ( s n ) ] T ∈ R n
r π = [ r π ( s 1 ) , ⋯ , r π ( s n ) ] T ∈ R n r_{\pi}=[r_{\pi}(s_1),\cdots,r_{\pi}(s_n)]^{T}\in\mathbb{R}^{n} r π = [ r π ( s 1 ) , ⋯ , r π ( s n ) ] T ∈ R n
P π ∈ R n × n P_{\pi}\in\mathbb{R}^{n\times n} P π ∈ R n × n , where [ P π ] i j = p π ( s j ∣ s i ) [P_{\pi}]_{ij}=p_{\pi}(s_j|s_i) [ P π ] ij = p π ( s j ∣ s i ) , is the state transition matrix
If there are four states, v π = r π + γ P π v π v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi} v π = r π + γ P π v π can be written out as
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ r π ( s 1 ) r π ( s 2 ) r π ( s 3 ) r π ( s 4 ) ] + γ [ p π ( s 1 ∣ s 1 ) p π ( s 2 ∣ s 1 ) p π ( s 3 ∣ s 1 ) p π ( s 4 ∣ s 1 ) p π ( s 1 ∣ s 2 ) p π ( s 2 ∣ s 2 ) p π ( s 3 ∣ s 2 ) p π ( s 4 ∣ s 2 ) p π ( s 1 ∣ s 3 ) p π ( s 2 ∣ s 3 ) p π ( s 3 ∣ s 3 ) p π ( s 4 ∣ s 3 ) p π ( s 1 ∣ s 4 ) p π ( s 2 ∣ s 4 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 4 ) ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} = \begin{bmatrix} r_{\pi}(s_1)\\ r_{\pi}(s_2)\\ r_{\pi}(s_3)\\ r_{\pi}(s_4) \end{bmatrix} +\gamma \begin{bmatrix} p_{\pi}(s_1|s_1)&p_{\pi}(s_2|s_1)&p_{\pi}(s_3|s_1)&p_{\pi}(s_4|s_1)\\ p_{\pi}(s_1|s_2)&p_{\pi}(s_2|s_2)&p_{\pi}(s_3|s_2)&p_{\pi}(s_4|s_2)\\ p_{\pi}(s_1|s_3)&p_{\pi}(s_2|s_3)&p_{\pi}(s_3|s_3)&p_{\pi}(s_4|s_3)\\ p_{\pi}(s_1|s_4)&p_{\pi}(s_2|s_4)&p_{\pi}(s_3|s_4)&p_{\pi}(s_4|s_4) \end{bmatrix} \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) = r π ( s 1 ) r π ( s 2 ) r π ( s 3 ) r π ( s 4 ) + γ p π ( s 1 ∣ s 1 ) p π ( s 1 ∣ s 2 ) p π ( s 1 ∣ s 3 ) p π ( s 1 ∣ s 4 ) p π ( s 2 ∣ s 1 ) p π ( s 2 ∣ s 2 ) p π ( s 2 ∣ s 3 ) p π ( s 2 ∣ s 4 ) p π ( s 3 ∣ s 1 ) p π ( s 3 ∣ s 2 ) p π ( s 3 ∣ s 3 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 1 ) p π ( s 4 ∣ s 2 ) p π ( s 4 ∣ s 3 ) p π ( s 4 ∣ s 4 ) v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) Example
For this specific example:
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ 0 1 1 1 ] + γ [ 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} = \begin{bmatrix} 0\\ 1\\ 1\\ 1 \end{bmatrix} +\gamma \begin{bmatrix} 0&0&1&0\\ 0&0&0&1\\ 0&0&0&1\\ 0&0&0&1 \end{bmatrix} \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) = 0 1 1 1 + γ 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) Example
For this specific example:
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ 0.5 ( 0 ) + 0.5 ( − 1 ) 1 1 1 ] + γ [ 0 0.5 0.5 0 0 0 0 1 0 0 0 1 0 0 0 1 ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} = \begin{bmatrix} 0.5(0)+0.5(-1)\\ 1\\ 1\\ 1 \end{bmatrix} +\gamma \begin{bmatrix} 0&0.5&0.5&0\\ 0&0&0&1\\ 0&0&0&1\\ 0&0&0&1 \end{bmatrix} \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) = 0.5 ( 0 ) + 0.5 ( − 1 ) 1 1 1 + γ 0 0 0 0 0.5 0 0 0 0.5 0 0 0 0 1 1 1 v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) Solve state values# The Bellman equation in matrix vector form is
v π = r π + γ P π v π ⟹ v π = ( I − γ P π ) − 1 r π v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}\Longrightarrow v_{\pi}=(I - \gamma P_{\pi})^{-1}r_{\pi} v π = r π + γ P π v π ⟹ v π = ( I − γ P π ) − 1 r π In practice, we still need to use numerical tools to calculate the matrix inverse.
Can we avoid the matrix inverse operation? Yes, by iterative algorithms.
An iterative solution is:
v k + 1 = r π + γ P π v k . v_{k + 1}=r_{\pi}+\gamma P_{\pi}v_{k}. v k + 1 = r π + γ P π v k . This algorithm leads to a sequence { v 0 , v 1 , v 2 , ⋯ } \{v_0, v_1, v_2,\cdots\} { v 0 , v 1 , v 2 , ⋯ } . We can show that
v k → v π = ( I − γ P π ) − 1 r π , k → ∞ . v_{k}\rightarrow v_{\pi}=(I - \gamma P_{\pi})^{-1}r_{\pi},\quad k\rightarrow\infty. v k → v π = ( I − γ P π ) − 1 r π , k → ∞. Proof. Define the error as δ k = v k − v π \delta_{k}=v_{k}-v_{\pi} δ k = v k − v π . We only need to show δ k → 0 \delta_{k}\rightarrow0 δ k → 0 . Substituting v k + 1 = δ k + 1 + v π v_{k + 1}=\delta_{k + 1}+v_{\pi} v k + 1 = δ k + 1 + v π and v k = δ k + v π v_{k}=\delta_{k}+v_{\pi} v k = δ k + v π into v k + 1 = r π + γ P π v k v_{k + 1}=r_{\pi}+\gamma P_{\pi}v_{k} v k + 1 = r π + γ P π v k gives
δ k + 1 + v π = r π + γ P π ( δ k + v π ) , \delta_{k + 1}+v_{\pi}=r_{\pi}+\gamma P_{\pi}(\delta_{k}+v_{\pi}), δ k + 1 + v π = r π + γ P π ( δ k + v π ) , which can be rewritten as
δ k + 1 = − v π + r π + γ P π δ k + γ P π v π = γ P π δ k . \delta_{k + 1}=-v_{\pi}+r_{\pi}+\gamma P_{\pi}\delta_{k}+\gamma P_{\pi}v_{\pi}=\gamma P_{\pi}\delta_{k}. δ k + 1 = − v π + r π + γ P π δ k + γ P π v π = γ P π δ k . As a result,
δ k + 1 = γ P π δ k = γ 2 P π 2 δ k − 1 = ⋯ = γ k + 1 P π k + 1 δ 0 . \delta_{k + 1}=\gamma P_{\pi}\delta_{k}=\gamma^{2}P_{\pi}^{2}\delta_{k - 1}=\cdots=\gamma^{k + 1}P_{\pi}^{k + 1}\delta_{0}. δ k + 1 = γ P π δ k = γ 2 P π 2 δ k − 1 = ⋯ = γ k + 1 P π k + 1 δ 0 . Note that 0 ≤ P π k ≤ 1 0\leq P_{\pi}^{k}\leq1 0 ≤ P π k ≤ 1 , which means every entry of P π k P_{\pi}^{k} P π k is no greater than 1 for any k = 0 , 1 , 2 , ⋯ k = 0,1,2,\cdots k = 0 , 1 , 2 , ⋯ .
That is because P π k 1 = 1 P_{\pi}^{k}\mathbf{1}=\mathbf{1} P π k 1 = 1 , where 1 = [ 1 , ⋯ , 1 ] T \mathbf{1}=[1,\cdots,1]^{T} 1 = [ 1 , ⋯ , 1 ] T . On the other hand, since γ < 1 \gamma < 1 γ < 1 , we know γ k → 0 \gamma^{k}\rightarrow0 γ k → 0 and hence δ k + 1 = γ k + 1 P π k + 1 δ 0 → 0 \delta_{k + 1}=\gamma^{k + 1}P_{\pi}^{k + 1}\delta_{0}\rightarrow0 δ k + 1 = γ k + 1 P π k + 1 δ 0 → 0 as k → ∞ k\rightarrow\infty k → ∞ .
Action Value# The action value of a state - action pair ( s , a ) (s, a) ( s , a ) is defined as
q π ( s , a ) ≐ E [ G t ∣ S t = s , A t = a ] . q_{\pi}(s,a)\doteq\mathbb{E}[G_t|S_t = s,A_t = a]. q π ( s , a ) ≐ E [ G t ∣ S t = s , A t = a ] . It represents the expected return obtained after taking action a a a at state s s s .
It depends on the state - action pair ( s , a ) (s, a) ( s , a ) rather than just an action.
The relationship between action values and state values
First, it follows from the properties of conditional expectation that
v π ( s ) = E [ G t ∣ S t = s ] ⏟ v π ( s ) = ∑ a ∈ A E [ G t ∣ S t = s , A t = a ] ⏟ q π ( s , a ) π ( a ∣ s ) . v_{\pi}(s)=\underbrace{\mathbb{E}[G_t|S_t = s]}_{v_{\pi}(s)}=\sum_{a\in\mathcal{A}}\underbrace{\mathbb{E}[G_t|S_t = s,A_t = a]}_{q_{\pi}(s,a)}\pi(a|s). v π ( s ) = v π ( s ) E [ G t ∣ S t = s ] = a ∈ A ∑ q π ( s , a ) E [ G t ∣ S t = s , A t = a ] π ( a ∣ s ) . It then follows that v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_{\pi}(s,a) v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) As a result, a state value is the expectation of the action values associated with that state.
Second, since the state value is given by
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) [ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) ] v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right] v π ( s ) = a ∈ A ∑ π ( a ∣ s ) [ r ∈ R ∑ p ( r ∣ s , a ) r + γ s ′ ∈ S ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ] which leads to q π ( s , a ) = ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) . q_{\pi}(s,a)=\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s'). q π ( s , a ) = ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) .
Example
q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) , q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) , q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) , q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) . \begin{gather*} q_{\pi}(s_1,a_1)&=-1+\gamma v_{\pi}(s_1),\\ q_{\pi}(s_1,a_3)&=0+\gamma v_{\pi}(s_3),\\ q_{\pi}(s_1,a_4)&=-1+\gamma v_{\pi}(s_1),\\ q_{\pi}(s_1,a_5)&=0+\gamma v_{\pi}(s_1). \end{gather*} q π ( s 1 , a 1 ) q π ( s 1 , a 3 ) q π ( s 1 , a 4 ) q π ( s 1 , a 5 ) = − 1 + γ v π ( s 1 ) , = 0 + γ v π ( s 3 ) , = − 1 + γ v π ( s 1 ) , = 0 + γ v π ( s 1 ) . Bellman optimality equation# Optimal# The state value could be used to evaluate if a policy is good or not: if
v π 1 ( s ) ≥ v π 2 ( s ) for all s ∈ S v_{\pi_1}(s)\geq v_{\pi_2}(s)\quad\text{for all }s\in\mathcal{S} v π 1 ( s ) ≥ v π 2 ( s ) for all s ∈ S
then π 1 \pi_1 π 1 is “better” than π 2 \pi_2 π 2 .
A policy π ∗ \pi^* π ∗ is optimal if v π ∗ ( s ) ≥ v π ( s ) v_{\pi^*}(s)\geq v_{\pi}(s) v π ∗ ( s ) ≥ v π ( s ) for all s s s and for any other policy π \pi π .
For every s ∈ S s\in\mathcal{S} s ∈ S , the elementwise expression of the BOE is
v ( s ) = max π ( s ) ∈ Π ( s ) ∑ a ∈ A π ( a ∣ s ) ( ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v ( s ′ ) ) = max π ( s ) ∈ Π ( s ) ∑ a ∈ A π ( a ∣ s ) q ( s , a ) , \begin{align*} v(s)&=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)\left(\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v(s')\right)\\ &=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)q(s,a), \end{align*} v ( s ) = π ( s ) ∈ Π ( s ) max a ∈ A ∑ π ( a ∣ s ) ( r ∈ R ∑ p ( r ∣ s , a ) r + γ s ′ ∈ S ∑ p ( s ′ ∣ s , a ) v ( s ′ ) ) = π ( s ) ∈ Π ( s ) max a ∈ A ∑ π ( a ∣ s ) q ( s , a ) , where v ( s ) , v ( s ′ ) v(s),v(s') v ( s ) , v ( s ′ ) are unknown variables to be solved and
q ( s , a ) ≐ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v ( s ′ ) q(s,a)\doteq\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v(s') q ( s , a ) ≐ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v ( s ′ )
Bellman optimality equation (matrix - vector form):
v = max π ( r π + γ P π v ) v = \max_{\pi}(r_{\pi}+\gamma P_{\pi}v) v = max π ( r π + γ P π v )
where the elements corresponding to s s s or s ′ s' s ′ are
[ r π ] s ≜ ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r , [r_{\pi}]_{s}\triangleq\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r, [ r π ] s ≜ ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r ,
[ P π ] s , s ′ = p ( s ′ ∣ s ) ≜ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) . [P_{\pi}]_{s,s'}=p(s'|s)\triangleq\sum_{a}\pi(a|s)\sum_{s'}p(s'|s,a). [ P π ] s , s ′ = p ( s ′ ∣ s ) ≜ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) .
Here max π \max_{\pi} max π is performed elementwise.
Example: Consider two variables x , a ∈ R x,a\in\mathbb{R} x , a ∈ R . Suppose they satisfy
x = max a ( 2 x − 1 − a 2 ) x = \max_{a}(2x - 1 - a^{2}) x = a max ( 2 x − 1 − a 2 ) This equation has two unknowns. To solve them, first consider the right hand side. Regardless the value of x x x , max a ( 2 x − 1 − a 2 ) = 2 x − 1 \max_{a}(2x - 1 - a^{2})=2x - 1 max a ( 2 x − 1 − a 2 ) = 2 x − 1 where the maximization is achieved when a = 0 a = 0 a = 0 . Second, when a = 0 a = 0 a = 0 , the equation becomes x = 2 x − 1 x = 2x - 1 x = 2 x − 1 , which leads to x = 1 x = 1 x = 1 . Therefore, a = 0 a = 0 a = 0 and x = 1 x = 1 x = 1 are the solution of the equation.
Example (How to solve max π ∑ a π ( a ∣ s ) q ( s , a ) \max_{\pi}\sum_{a}\pi(a|s)q(s,a) max π ∑ a π ( a ∣ s ) q ( s , a ) ) Suppose q 1 , q 2 , q 3 ∈ R q_1,q_2,q_3\in\mathbb{R} q 1 , q 2 , q 3 ∈ R are given. Find c 1 ∗ , c 2 ∗ , c 3 ∗ c_1^*,c_2^*,c_3^* c 1 ∗ , c 2 ∗ , c 3 ∗ solving
max c 1 , c 2 , c 3 c 1 q 1 + c 2 q 2 + c 3 q 3 \max_{c_1,c_2,c_3}c_1q_1 + c_2q_2 + c_3q_3 max c 1 , c 2 , c 3 c 1 q 1 + c 2 q 2 + c 3 q 3
where c 1 + c 2 + c 3 = 1 c_1 + c_2 + c_3 = 1 c 1 + c 2 + c 3 = 1 and c 1 , c 2 , c 3 ≥ 0 c_1,c_2,c_3\geq0 c 1 , c 2 , c 3 ≥ 0 .
Without loss of generality, suppose q 3 ≥ q 1 , q 2 q_3\geq q_1,q_2 q 3 ≥ q 1 , q 2 . Then, the optimal solution is c 3 ∗ = 1 c_3^* = 1 c 3 ∗ = 1 and c 1 ∗ = c 2 ∗ = 0 c_1^* = c_2^* = 0 c 1 ∗ = c 2 ∗ = 0 . That is because for any c 1 , c 2 , c 3 c_1,c_2,c_3 c 1 , c 2 , c 3
q 3 = ( c 1 + c 2 + c 3 ) q 3 = c 1 q 3 + c 2 q 3 + c 3 q 3 ≥ c 1 q 1 + c 2 q 2 + c 3 q 3 q_3=(c_1 + c_2 + c_3)q_3=c_1q_3 + c_2q_3 + c_3q_3\geq c_1q_1 + c_2q_2 + c_3q_3 q 3 = ( c 1 + c 2 + c 3 ) q 3 = c 1 q 3 + c 2 q 3 + c 3 q 3 ≥ c 1 q 1 + c 2 q 2 + c 3 q 3
Maximization on the right hand side of BOE# Fix v ′ ( s ) v'(s) v ′ ( s ) first and solve π \pi π :
v ( s ) = max π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ) , ∀ s ∈ S = max π ∑ a π ( a ∣ s ) q ( s , a ) \begin{align*} v(s)&=\max_{\pi}\sum_{a}\pi(a|s)\left(\sum_{r}p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v(s')\right),\quad\forall s\in\mathcal{S}\\ &=\max_{\pi}\sum_{a}\pi(a|s)q(s,a) \end{align*} v ( s ) = π max a ∑ π ( a ∣ s ) ( r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v ( s ′ ) ) , ∀ s ∈ S = π max a ∑ π ( a ∣ s ) q ( s , a ) Inspired by the above example, considering that ∑ a π ( a ∣ s ) = 1 \sum_{a}\pi(a|s)=1 ∑ a π ( a ∣ s ) = 1 , we have
max π ∑ a π ( a ∣ s ) q ( s , a ) = max a ∈ A ( s ) q ( s , a ) \max_{\pi}\sum_{a}\pi(a|s)q(s,a)=\max_{a\in\mathcal{A}(s)}q(s,a) max π ∑ a π ( a ∣ s ) q ( s , a ) = max a ∈ A ( s ) q ( s , a )
where the optimality is achieved when
π ( a ∣ s ) = { 1 a = a ∗ 0 a ≠ a ∗ \pi(a|s)= \begin{cases} 1 & a = a^*\\ 0 & a\neq a^* \end{cases} π ( a ∣ s ) = { 1 0 a = a ∗ a = a ∗ where a ∗ = arg max a q ( s , a ) a^*=\arg\max_{a}q(s,a) a ∗ = arg max a q ( s , a ) .
The BOE refers to a set of equations defined for all states. If we combine these equations, we can obtain a concise matrix - vector form, which will be extensively used in this chapter.
The matrix - vector form of the BOE is
v = max π ∈ Π ( r π + γ P π v ) , v=\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v), v = max π ∈ Π ( r π + γ P π v ) ,
where v ∈ R ∣ S ∣ v\in\mathbb{R}^{|\mathcal{S}|} v ∈ R ∣ S ∣ and max π \max_{\pi} max π is performed in an elementwise manner. The structures of r π r_{\pi} r π and P π P_{\pi} P π are the same as those in the matrix - vector form of the normal Bellman equation:
[ r π ] s ≐ ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r , [ P π ] s , s ′ = p ( s ′ ∣ s ) ≐ ∑ a ∈ A π ( a ∣ s ) p ( s ′ ∣ s , a ) [r_{\pi}]_{s}\doteq\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r,\quad[P_{\pi}]_{s,s'}=p(s'|s)\doteq\sum_{a\in\mathcal{A}}\pi(a|s)p(s'|s,a) [ r π ] s ≐ a ∈ A ∑ π ( a ∣ s ) r ∈ R ∑ p ( r ∣ s , a ) r , [ P π ] s , s ′ = p ( s ′ ∣ s ) ≐ a ∈ A ∑ π ( a ∣ s ) p ( s ′ ∣ s , a ) Since the optimal value of π \pi π is determined by v v v , the right - hand side of (3.2) is a function of v v v , denoted as
f ( v ) ≐ max π ∈ Π ( r π + γ P π v ) f(v)\doteq\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v) f ( v ) ≐ max π ∈ Π ( r π + γ P π v )
Then, the BOE can be expressed in a concise form as
v = f ( v ) v = f(v) v = f ( v )
In the remainder of this section, we show how to solve this nonlinear equation.
Contraction property of the BOE# Theorem: The function
f ( v ) = max π ∈ Π ( r π + γ P π v ) f(v)=\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v) f ( v ) = π ∈ Π max ( r π + γ P π v ) on the right hand side of the BOE is a contraction mapping. In particular, for any v 1 , v 2 ∈ R ∣ S ∣ v_1,v_2\in\mathbb{R}^{|\mathcal{S}|} v 1 , v 2 ∈ R ∣ S ∣ , it holds that
∥ f ( v 1 ) − f ( v 2 ) ∥ ∞ ≤ γ ∥ v 1 − v 2 ∥ ∞ , \|f(v_1)-f(v_2)\|_{\infty}\leq\gamma\|v_1 - v_2\|_{\infty}, ∥ f ( v 1 ) − f ( v 2 ) ∥ ∞ ≤ γ ∥ v 1 − v 2 ∥ ∞ , where γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ ∈ ( 0 , 1 ) is the discount rate, and ∥ ⋅ ∥ ∞ \|\cdot\|_{\infty} ∥ ⋅ ∥ ∞ is the maximum norm, which is the maximum absolute value of the elements of a vector.
Proof. Consider any two vectors v 1 , v 2 ∈ R ∣ S ∣ v_1,v_2\in\mathbb{R}^{|\mathcal{S}|} v 1 , v 2 ∈ R ∣ S ∣ , and suppose that
π 1 ∗ ≐ arg max π ( r π + γ P π v 1 ) \pi_1^*\doteq\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_1) π 1 ∗ ≐ arg π max ( r π + γ P π v 1 ) and π 2 ∗ ≐ arg max π ( r π + γ P π v 2 ) . \pi_2^*\doteq\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_2). π 2 ∗ ≐ arg max π ( r π + γ P π v 2 ) . Then,
f ( v 1 ) = max π ( r π + γ P π v 1 ) = r π 1 ∗ + γ P π 1 ∗ v 1 ≥ r π 2 ∗ + γ P π 2 ∗ v 1 , f ( v 2 ) = max π ( r π + γ P π v 2 ) = r π 2 ∗ + γ P π 2 ∗ v 2 ≥ r π 1 ∗ + γ P π 1 ∗ v 2 , \begin{align*} f(v_1)&=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_1)=r_{\pi_1^*}+\gamma P_{\pi_1^*}v_1\geq r_{\pi_2^*}+\gamma P_{\pi_2^*}v_1,\\ f(v_2)&=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_2)=r_{\pi_2^*}+\gamma P_{\pi_2^*}v_2\geq r_{\pi_1^*}+\gamma P_{\pi_1^*}v_2, \end{align*} f ( v 1 ) f ( v 2 ) = π max ( r π + γ P π v 1 ) = r π 1 ∗ + γ P π 1 ∗ v 1 ≥ r π 2 ∗ + γ P π 2 ∗ v 1 , = π max ( r π + γ P π v 2 ) = r π 2 ∗ + γ P π 2 ∗ v 2 ≥ r π 1 ∗ + γ P π 1 ∗ v 2 , As a result,
f ( v 1 ) − f ( v 2 ) = r π 1 ∗ + γ P π 1 ∗ v 1 − ( r π 2 ∗ + γ P π 2 ∗ v 2 ) ≤ r π 1 ∗ + γ P π 1 ∗ v 1 − ( r π 1 ∗ + γ P π 1 ∗ v 2 ) = γ P π 1 ∗ ( v 1 − v 2 ) . \begin{align*} f(v_1)-f(v_2)&=r_{\pi_1^*}+\gamma P_{\pi_1^*}v_1-(r_{\pi_2^*}+\gamma P_{\pi_2^*}v_2)\\ &\leq r_{\pi_1^*}+\gamma P_{\pi_1^*}v_1-(r_{\pi_1^*}+\gamma P_{\pi_1^*}v_2)\\ &=\gamma P_{\pi_1^*}(v_1 - v_2). \end{align*} f ( v 1 ) − f ( v 2 ) = r π 1 ∗ + γ P π 1 ∗ v 1 − ( r π 2 ∗ + γ P π 2 ∗ v 2 ) ≤ r π 1 ∗ + γ P π 1 ∗ v 1 − ( r π 1 ∗ + γ P π 1 ∗ v 2 ) = γ P π 1 ∗ ( v 1 − v 2 ) . Similarly, it can be shown that f ( v 2 ) − f ( v 1 ) ≤ γ P π 2 ∗ ( v 2 − v 1 ) f(v_2)-f(v_1)\leq\gamma P_{\pi_2^*}(v_2 - v_1) f ( v 2 ) − f ( v 1 ) ≤ γ P π 2 ∗ ( v 2 − v 1 ) . Therefore,
γ P π 2 ∗ ( v 1 − v 2 ) ≤ f ( v 1 ) − f ( v 2 ) ≤ γ P π 1 ∗ ( v 1 − v 2 ) . \gamma P_{\pi_2^*}(v_1 - v_2)\leq f(v_1)-f(v_2)\leq\gamma P_{\pi_1^*}(v_1 - v_2). γ P π 2 ∗ ( v 1 − v 2 ) ≤ f ( v 1 ) − f ( v 2 ) ≤ γ P π 1 ∗ ( v 1 − v 2 ) . Define
z ≐ max { ∣ γ P π 2 ∗ ( v 1 − v 2 ) ∣ , ∣ γ P π 1 ∗ ( v 1 − v 2 ) ∣ } ∈ R ∣ S ∣ , z\doteq\max\{|\gamma P_{\pi_2^*}(v_1 - v_2)|,|\gamma P_{\pi_1^*}(v_1 - v_2)|\}\in\mathbb{R}^{|\mathcal{S}|}, z ≐ max { ∣ γ P π 2 ∗ ( v 1 − v 2 ) ∣ , ∣ γ P π 1 ∗ ( v 1 − v 2 ) ∣ } ∈ R ∣ S ∣ , where max ( ⋅ ) \max(\cdot) max ( ⋅ ) , ∣ ⋅ ∣ |\cdot| ∣ ⋅ ∣ , and ≥ \geq ≥ are all elementwise operators. By definition, z ≥ 0 z\geq0 z ≥ 0 . On one hand, it is easy to see that
− z ≤ γ P π 2 ∗ ( v 1 − v 2 ) ≤ f ( v 1 ) − f ( v 2 ) ≤ γ P π 1 ∗ ( v 1 − v 2 ) ≤ z , -z\leq\gamma P_{\pi_2^*}(v_1 - v_2)\leq f(v_1)-f(v_2)\leq\gamma P_{\pi_1^*}(v_1 - v_2)\leq z, − z ≤ γ P π 2 ∗ ( v 1 − v 2 ) ≤ f ( v 1 ) − f ( v 2 ) ≤ γ P π 1 ∗ ( v 1 − v 2 ) ≤ z , which implies
∣ f ( v 1 ) − f ( v 2 ) ∣ ≤ z . |f(v_1)-f(v_2)|\leq z. ∣ f ( v 1 ) − f ( v 2 ) ∣ ≤ z . It then follows that
∥ f ( v 1 ) − f ( v 2 ) ∥ ∞ ≤ ∥ z ∥ ∞ , \|f(v_1)-f(v_2)\|_{\infty}\leq\|z\|_{\infty}, ∥ f ( v 1 ) − f ( v 2 ) ∥ ∞ ≤ ∥ z ∥ ∞ , where ∥ ⋅ ∥ ∞ \|\cdot\|_{\infty} ∥ ⋅ ∥ ∞ is the maximum norm.
On the other hand, suppose that z i z_i z i is the i i i th entry of z z z , and p i T p_i^T p i T and q i T q_i^T q i T are the i i i th row of P π 1 ∗ P_{\pi_1^*} P π 1 ∗ and P π 2 ∗ P_{\pi_2^*} P π 2 ∗ , respectively. Then,
z i = max { γ ∣ p i T ( v 1 − v 2 ) ∣ , γ ∣ q i T ( v 1 − v 2 ) ∣ } . z_i=\max\{\gamma|p_i^T(v_1 - v_2)|,\gamma|q_i^T(v_1 - v_2)|\}. z i = max { γ ∣ p i T ( v 1 − v 2 ) ∣ , γ ∣ q i T ( v 1 − v 2 ) ∣ } . Since p i p_i p i is a vector with all nonnegative elements and the sum of the elements is equal to one, it follows that
∣ p i T ( v 1 − v 2 ) ∣ ≤ p i T ⋅ ∣ v 1 − v 2 ∣ ≤ ∥ v 1 − v 2 ∥ ∞ . |p_i^T(v_1 - v_2)|\leq p_i^T\cdot|v_1 - v_2|\leq\|v_1 - v_2\|_{\infty}. ∣ p i T ( v 1 − v 2 ) ∣ ≤ p i T ⋅ ∣ v 1 − v 2 ∣ ≤ ∥ v 1 − v 2 ∥ ∞ . Similarly, we have ∣ q i T ( v 1 − v 2 ) ∣ ≤ ∥ v 1 − v 2 ∥ ∞ |q_i^T(v_1 - v_2)|\leq\|v_1 - v_2\|_{\infty} ∣ q i T ( v 1 − v 2 ) ∣ ≤ ∥ v 1 − v 2 ∥ ∞ . Therefore, z i ≤ γ ∥ v 1 − v 2 ∥ ∞ z_i\leq\gamma\|v_1 - v_2\|_{\infty} z i ≤ γ ∥ v 1 − v 2 ∥ ∞ and hence
∥ z ∥ ∞ = max i ∣ z i ∣ ≤ γ ∥ v 1 − v 2 ∥ ∞ . \|z\|_{\infty}=\max_{i}|z_i|\leq\gamma\|v_1 - v_2\|_{\infty}. ∥ z ∥ ∞ = i max ∣ z i ∣ ≤ γ ∥ v 1 − v 2 ∥ ∞ . Substituting this inequality to (3.5) gives
∥ f ( v 1 ) − f ( v 2 ) ∥ ∞ ≤ γ ∥ v 1 − v 2 ∥ ∞ , \|f(v_1)-f(v_2)\|_{\infty}\leq\gamma\|v_1 - v_2\|_{\infty}, ∥ f ( v 1 ) − f ( v 2 ) ∥ ∞ ≤ γ ∥ v 1 − v 2 ∥ ∞ , which concludes the proof of the contraction property of f ( v ) f(v) f ( v ) .
Theorem (Existence, Uniqueness, and Algorithm)
For the BOE v = f ( v ) = max π ( r π + γ P π v ) v = f(v)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v) v = f ( v ) = max π ( r π + γ P π v ) , there always exists a solution v ∗ v^* v ∗ and the solution is unique. The solution could be solved iteratively by
v k + 1 = f ( v k ) = max π ( r π + γ P π v k ) v_{k + 1}=f(v_k)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_k) v k + 1 = f ( v k ) = π max ( r π + γ P π v k ) This sequence { v k } \{v_k\} { v k } converges to v ∗ v^* v ∗ exponentially fast given any initial guess v 0 v_0 v 0 . The convergence rate is determined by γ \gamma γ .
Optimal policy# Suppose v ∗ v^* v ∗ is the solution to the Bellman optimality equation. It satisfies
v ∗ = max π ( r π + γ P π v ∗ ) v^*=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*) v ∗ = π max ( r π + γ P π v ∗ ) Suppose
π ∗ = arg max π ( r π + γ P π v ∗ ) \pi^*=\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*) π ∗ = arg π max ( r π + γ P π v ∗ ) Then
v ∗ = r π ∗ + γ P π ∗ v ∗ v^*=r_{\pi^*}+\gamma P_{\pi^*}v^* v ∗ = r π ∗ + γ P π ∗ v ∗ Therefore, π ∗ \pi^* π ∗ is a policy and v ∗ = v π ∗ v^* = v_{\pi^*} v ∗ = v π ∗ is the corresponding state value.
Theorem 3.4 (Optimality of v ∗ v^* v ∗ and π ∗ \pi^* π ∗ )
The solution v ∗ v^* v ∗ is the optimal state value, and π ∗ \pi^* π ∗ is an optimal policy. That is, for any policy π \pi π , it holds that
v ∗ = v π ∗ ≥ v π , v^* = v_{\pi^*}\geq v_{\pi}, v ∗ = v π ∗ ≥ v π , where v π v_{\pi} v π is the state value of π \pi π , and ≥ \geq ≥ is an elementwise comparison.
Proof. For any policy π \pi π , it holds that
v π = r π + γ P π v π . v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}. v π = r π + γ P π v π .
Since
v ∗ = max π ( r π + γ P π v ∗ ) = r π ∗ + γ P π ∗ v ∗ ≥ r π + γ P π v ∗ , v^*=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*)=r_{\pi^*}+\gamma P_{\pi^*}v^*\geq r_{\pi}+\gamma P_{\pi}v^*, v ∗ = max π ( r π + γ P π v ∗ ) = r π ∗ + γ P π ∗ v ∗ ≥ r π + γ P π v ∗ ,
we have
v ∗ − v π ≥ ( r π + γ P π v ∗ ) − ( r π + γ P π v π ) = γ P π ( v ∗ − v π ) . v^*-v_{\pi}\geq(r_{\pi}+\gamma P_{\pi}v^*)-(r_{\pi}+\gamma P_{\pi}v_{\pi})=\gamma P_{\pi}(v^*-v_{\pi}). v ∗ − v π ≥ ( r π + γ P π v ∗ ) − ( r π + γ P π v π ) = γ P π ( v ∗ − v π ) .
Repeatedly applying the above inequality gives
v ∗ − v π ≥ γ P π ( v ∗ − v π ) ≥ γ 2 P π 2 ( v ∗ − v π ) ≥ ⋯ ≥ γ n P π n ( v ∗ − v π ) . v^*-v_{\pi}\geq\gamma P_{\pi}(v^*-v_{\pi})\geq\gamma^{2}P_{\pi}^{2}(v^*-v_{\pi})\geq\cdots\geq\gamma^{n}P_{\pi}^{n}(v^*-v_{\pi}). v ∗ − v π ≥ γ P π ( v ∗ − v π ) ≥ γ 2 P π 2 ( v ∗ − v π ) ≥ ⋯ ≥ γ n P π n ( v ∗ − v π ) . It follows that
v ∗ − v π ≥ lim n → ∞ γ n P π n ( v ∗ − v π ) = 0 , v^*-v_{\pi}\geq\lim_{n\rightarrow\infty}\gamma^{n}P_{\pi}^{n}(v^*-v_{\pi}) = 0, v ∗ − v π ≥ n → ∞ lim γ n P π n ( v ∗ − v π ) = 0 , where the last equality is true because γ < 1 \gamma<1 γ < 1 and P π n P_{\pi}^{n} P π n is a nonnegative matrix with all its elements less than or equal to 1 1 1 (because P π n 1 = 1 P_{\pi}^{n}\mathbf{1}=\mathbf{1} P π n 1 = 1 ). Therefore, v ∗ ≥ v π v^*\geq v_{\pi} v ∗ ≥ v π for any π \pi π .
Theorem (Greedy optimal policy)
For any s ∈ S s\in\mathcal{S} s ∈ S , the deterministic greedy policy
π ∗ ( a ∣ s ) = { 1 , a = a ∗ ( s ) 0 , a ≠ a ∗ ( s ) \pi^*(a|s)= \begin{cases} 1, & a = a^*(s)\\ 0, & a\neq a^*(s) \end{cases} π ∗ ( a ∣ s ) = { 1 , 0 , a = a ∗ ( s ) a = a ∗ ( s ) is an optimal policy for solving the BOE. Here,
a ∗ ( s ) = arg max a q ∗ ( a , s ) , a^*(s)=\arg\max_{a}q^*(a,s), a ∗ ( s ) = arg a max q ∗ ( a , s ) , where
q ∗ ( s , a ) ≐ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v ∗ ( s ′ ) q^*(s,a)\doteq\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v^*(s') q ∗ ( s , a ) ≐ r ∈ R ∑ p ( r ∣ s , a ) r + γ s ′ ∈ S ∑ p ( s ′ ∣ s , a ) v ∗ ( s ′ ) Proof.
While the matrix - vector form of the optimal policy is π ∗ = arg max π ( r π + γ P π v ∗ ) \pi^*=\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*) π ∗ = arg max π ( r π + γ P π v ∗ ) , its elementwise form is
π ∗ ( s ) = arg max π ∈ Π ∑ a ∈ A π ( a ∣ s ) ( ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v ∗ ( s ′ ) ) ⏟ q ∗ ( s , a ) , s ∈ S \pi^*(s)=\arg\max_{\pi\in\Pi}\sum_{a\in\mathcal{A}}\pi(a|s)\underbrace{\left(\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v^*(s')\right)}_{q^*(s,a)},\quad s\in\mathcal{S} π ∗ ( s ) = arg π ∈ Π max a ∈ A ∑ π ( a ∣ s ) q ∗ ( s , a ) ( r ∈ R ∑ p ( r ∣ s , a ) r + γ s ′ ∈ S ∑ p ( s ′ ∣ s , a ) v ∗ ( s ′ ) ) , s ∈ S It is clear that ∑ a ∈ A π ( a ∣ s ) q ∗ ( s , a ) \sum_{a\in\mathcal{A}}\pi(a|s)q^*(s,a) ∑ a ∈ A π ( a ∣ s ) q ∗ ( s , a ) is maximized if π ( s ) \pi(s) π ( s ) selects the action with the greatest q ∗ ( s , a ) q^*(s,a) q ∗ ( s , a ) .
Factors that influence optimal policies# The BOE is a powerful tool for analyzing optimal policies. We next apply the BOE to study what factors can influence optimal policies. This question can be easily answered by observing the elementwise expression of the BOE:
v ( s ) = max π ( s ) ∈ Π ( s ) ∑ a ∈ A π ( a ∣ s ) ( ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v ( s ′ ) ) , s ∈ S . v(s)=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a | s)\left(\sum_{r\in\mathcal{R}}p(r | s, a)r+\gamma\sum_{s'\in\mathcal{S}}p(s' | s, a)v(s')\right),\quad s\in\mathcal{S}. v ( s ) = π ( s ) ∈ Π ( s ) max a ∈ A ∑ π ( a ∣ s ) ( r ∈ R ∑ p ( r ∣ s , a ) r + γ s ′ ∈ S ∑ p ( s ′ ∣ s , a ) v ( s ′ ) ) , s ∈ S . The optimal state value and optimal policy are determined by the following parameters:
the immediate reward r r r ,
the discount rate γ \gamma γ ,
the system model p ( s ′ ∣ s , a ) , p ( r ∣ s , a ) p(s' | s, a), p(r | s, a) p ( s ′ ∣ s , a ) , p ( r ∣ s , a ) .
While the system model is fixed, we next discuss how the optimal policy varies when we change the values of r r r and γ \gamma γ .
Impact of the discount rate#
Impact of the reward values# Theorem (Optimal policy invariance)
Consider a Markov decision process with v ∗ ∈ R ∣ S ∣ v^*\in\mathbb{R}^{|\mathcal{S}|} v ∗ ∈ R ∣ S ∣ as the optimal state value satisfying
v ∗ = max π ∈ Π ( r π + γ P π v ∗ ) . v^* = \max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v^*). v ∗ = π ∈ Π max ( r π + γ P π v ∗ ) . If every reward r ∈ R r\in\mathcal{R} r ∈ R is changed by an affine transformation to α r + β \alpha r+\beta α r + β , where α , β ∈ R \alpha,\beta\in\mathbb{R} α , β ∈ R and α > 0 \alpha > 0 α > 0 , then the corresponding optimal state value v ′ v' v ′ is also an affine transformation of v ∗ v^* v ∗ :
v ′ = α v ∗ + β 1 − γ 1 v'=\alpha v^*+\frac{\beta}{1 - \gamma}\mathbf{1} v ′ = α v ∗ + 1 − γ β 1 where γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ ∈ ( 0 , 1 ) is the discount rate and 1 = [ 1 , … , 1 ] T \mathbf{1} = [1,\ldots,1]^T 1 = [ 1 , … , 1 ] T .
Consequently, the optimal policy derived from v ′ v' v ′ is invariant to the affine transformation of the reward values.
Proof. For any policy π \pi π , define r π = [ … , r π ( s ) , … ] T r_{\pi}=[\ldots,r_{\pi}(s),\ldots]^T r π = [ … , r π ( s ) , … ] T where r π ( s ) = ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r , s ∈ S . r_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r,\quad s\in\mathcal{S}. r π ( s ) = ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r , s ∈ S . If r → α r + β r\to\alpha r+\beta r → α r + β , then r π ( s ) → α r π ( s ) + β r_{\pi}(s)\to\alpha r_{\pi}(s)+\beta r π ( s ) → α r π ( s ) + β and hence r π → α r π + β 1 r_{\pi}\to\alpha r_{\pi}+\beta\mathbf{1} r π → α r π + β 1 , where 1 = [ 1 , … , 1 ] T \mathbf{1}=[1,\ldots,1]^T 1 = [ 1 , … , 1 ] T . In this case, the BOE becomes
v ′ = max π ∈ Π ( α r π + β 1 + γ P π v ′ ) . v'=\max_{\pi\in\Pi}(\alpha r_{\pi}+\beta\mathbf{1}+\gamma P_{\pi}v'). v ′ = π ∈ Π max ( α r π + β 1 + γ P π v ′ ) . We next solve the new BOE by showing that v ′ = α v ∗ + c 1 v'=\alpha v^* + c\mathbf{1} v ′ = α v ∗ + c 1 with c = β / ( 1 − γ ) c = \beta/(1 - \gamma) c = β / ( 1 − γ ) is a solution. In particular, substituting v ′ = α v ∗ + c 1 v'=\alpha v^* + c\mathbf{1} v ′ = α v ∗ + c 1 gives
α v ∗ + c 1 = max π ∈ Π ( α r π + β 1 + γ P π ( α v ∗ + c 1 ) ) = max π ∈ Π ( α r π + β 1 + α γ P π v ∗ + γ c 1 ) , \alpha v^*+c\mathbf{1}=\max_{\pi\in\Pi}(\alpha r_{\pi}+\beta\mathbf{1}+\gamma P_{\pi}(\alpha v^* + c\mathbf{1}))=\max_{\pi\in\Pi}(\alpha r_{\pi}+\beta\mathbf{1}+\alpha\gamma P_{\pi}v^*+\gamma c\mathbf{1}), α v ∗ + c 1 = π ∈ Π max ( α r π + β 1 + γ P π ( α v ∗ + c 1 )) = π ∈ Π max ( α r π + β 1 + α γ P π v ∗ + γ c 1 ) , where the last equality is due to the fact that P π 1 = 1 P_{\pi}\mathbf{1}=\mathbf{1} P π 1 = 1 . The above equation can be reorganized as α v ∗ = max π ∈ Π ( α r π + α γ P π v ∗ ) + β 1 + γ c 1 − c 1 , \alpha v^*=\max_{\pi\in\Pi}(\alpha r_{\pi}+\alpha\gamma P_{\pi}v^*)+\beta\mathbf{1}+\gamma c\mathbf{1}-c\mathbf{1}, α v ∗ = max π ∈ Π ( α r π + α γ P π v ∗ ) + β 1 + γ c 1 − c 1 , which is equivalent to
β 1 + γ c 1 − c 1 = 0. \beta\mathbf{1}+\gamma c\mathbf{1}-c\mathbf{1}=0. β 1 + γ c 1 − c 1 = 0. Since c = β / ( 1 − γ ) c = \beta/(1 - \gamma) c = β / ( 1 − γ ) , the above equation is valid and hence v ′ = α v ∗ + c 1 v'=\alpha v^*+c\mathbf{1} v ′ = α v ∗ + c 1 is the solution.
Since is the BOE, v ′ v' v ′ is also the unique solution. Finally, since v ′ v' v ′ is an affine transformation of v ∗ v^* v ∗ , the relative relationships between the action values remain the same.
Hence, the greedy optimal policy derived from v ′ v' v ′ is the same as that from v ∗ v^* v ∗ : arg max π ∈ Π ( r π + γ P π v ′ ) \arg\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v') arg max π ∈ Π ( r π + γ P π v ′ ) is the same as arg max π ( r π + γ P π v ∗ ) \arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*) arg max π ( r π + γ P π v ∗ ) .