CiMorns
3347 words
17 minutes
Reinforcement Learning

Basic Concepts#

Law of Total Probability#

Discrete Random Variables#

  • Partition Conditions:

    • Events B1,B2,,BnB_1,B_2,\cdots,B_n form a partition of sample space Ω\Omega.

    • BiBj=B_i\cap B_j = \varnothing for iji\neq j (i,j=1,2,,ni,j = 1,2,\cdots,n) (pair - wise mutual exclusivity).

    • i=1nBi=Ω\bigcup_{i = 1}^{n}B_i=\Omega (union covers the whole sample space).

  • Formula: For any event AΩA\subseteq\Omega,

P(A)=i=1nP(Bi)P(ABi), E(A)=i=1nP(Bi)E(ABi)P(A)=\sum_{i = 1}^{n}P(B_i)P(A|B_i),~E(A)=\sum_{i = 1}^{n}P(B_i)E(A|B_i)

Continuous Random Variables#

  • Given Conditions:

    • XX is a continuous random variable with probability density function f(x)f(x).

    • YY is another random variable with values yy.

    • Conditional probability density function of XX given Y=yY = y is fXY(xy)f_{X|Y}(x|y).

    • Marginal probability density function of YY is fY(y)f_Y(y).

  • Formula:

fX(x)=fY(y)fXY(xy)dy, E(X)=fY(y)E(XY=y)dyf_X(x)=\int_{-\infty}^{\infty}f_Y(y)\cdot f_{X|Y}(x|y)dy,~E(X)=\int_{-\infty}^{\infty}f_Y(y)\cdot E(X|Y = y)dy

State and State space#

ss can include multi - dimensional coordinates, such as velocity, level, temperature and various other states.

  • State: The status of the agent with respect to the environment.

  • State space: The set of all states, denoted as S={si}i=19\mathcal{S} = \{s_i\}_{i = 1}^9.

alt text

Action and Action space#

  • Action:For each state, there are five possible actions: a1,,a5a_1,\cdots,a_5

    • a1a_1:move upwards, a2a_2:move rightwards;

    • a3a_3:move downwards, a4a_4:move leftwards;

    • a5a_5:stay unchanged;

  • Action space: the set of all possible actions of a state, A(si)={ai}i=15\mathcal{A}(s_i)=\{a_i\}_{i = 1}^5

State Transition Probability#

State transition defines the interaction with the environment.

  • State Transition: When an agent takes an action and moves from one state to another, this process is called state transition.

  • If action a2a_2 (move rightwards) is chosen at state s1s_1, the next state is s2s_2, denoted as s1a2s2.s_1\xrightarrow{a_2}s_2.

  • If action a1a_1 (move upwards) is chosen at state s1s_1, the next state remains s1s_1, denoted as s1a1s1.s_1\xrightarrow{a_1}s_1.

  • State Transition Probability: Use probability to describe state transition.

    • p(s2s1,a2)=1p(s_2|s_1, a_2) = 1, which means the probability of transitioning from state s1s_1 to state s2s_2 when action a2a_2 is taken is 11.

    • p(sis1,a2)=0p(s_i|s_1, a_2) = 0 for all i2i\neq2, that is, the probability of transitioning from state s1s_1 to any state other than s2s_2 when taking action a2a_2 is 00.

  • This is a deterministic case. State transition can also be stochastic, for example, p(s1s1,a2)=0.5,p(s5s1,a2)=0.5p(s_1|s_1, a_2) = 0.5,p(s_5|s_1, a_2) = 0.5.

Policy#

  • Definition:the agent which actions to perform in a given state.

  • Intuitive Representation:Arrows in a grid - like illustration can be used to demonstrate a policy. For example, the arrows in the provided grid show the recommended actions from different states.

alt text

  • Mathematical Representation:Use conditional probability (Take state s1s_1 as an example)
Deterministic policyStochastic policy
alt textalt text

Left: π(a1s1)=0,  π(a2s1)=1,  π(a3s1)=0,  π(a4s1)=0,  π(a5s1)=0.\pi(a_1|s_1) = 0,~~\pi(a_2|s_1) = 1,~~\pi(a_3|s_1) = 0,~~\pi(a_4|s_1) = 0,~~\pi(a_5|s_1) = 0.

Right: π(a1s1)=0,  π(a2s1)=0.5,  π(a3s1)=0.5,  π(a4s1)=0,  π(a5s1)=0\pi(a_1|s_1) = 0,~~\pi(a_2|s_1) = 0.5,~~\pi(a_3|s_1) = 0.5,~~\pi(a_4|s_1) = 0,~~\pi(a_5|s_1) = 0

Reward#

  • Definition: It is a real number obtained after an agent takes an action.

    • A positive reward encourages taking such actions, while a negative reward punishes taking them.

    • Zero reward implies no punishment. In some cases, a positive value can even represent punishment.

  • Representation Method:Use conditional probability for mathematical description.

  • Example at State s1s_1: p(r=1s1,a1)=1p(r = - 1|s_1,a_1)=1 and p(r1s1,a1)=0p(r\neq - 1|s_1,a_1)=0.

  • Role of Reward: Reward can be regarded as a human - machine interface. It is used to guide the agent to behave as expected.

    • Grid - world Example Rewards

      • If the agent tries to exit the boundary, rbound=1r_{bound}=-1.

      • If the agent attempts to enter a forbidden cell, rforbid=1r_{forbid} = - 1.

      • When the agent reaches the target cell, rtarget=+1r_{target}=+1.

      • Otherwise, the agent gets a reward of r=0r = 0.

      alt text

Trajectory and Return#

  • Trajectory: A trajectory is a chain of state - action - reward. It shows the path an agent takes, like a sequence of states it visits, actions it takes, and rewards it gets.

  • Return: Also called total rewards or cumulative rewards. It’s the sum of all the rewards collected along a trajectory.

    alt text

  • Example Left:starting from s1s_1,

    • The trajectory: s1a2r=0s2a3r=0s5a3r=0s8a2r=1s9.s_1\xrightarrow[a_2]{r = 0}s_2\xrightarrow[a_3]{r = 0}s_5\xrightarrow[a_3]{r = 0}s_8\xrightarrow[a_2]{r = 1}s_9.

    • Return = 0+0+0+1=10 + 0+0 + 1=1.

  • Example Right:starting from s1s_1,

    • The trajectory: s1a3r=0s4a3r=1s7a2r=0s8a2r=1s9.s_1\xrightarrow[a_3]{r = 0}s_4\xrightarrow[a_3]{r=-1}s_7\xrightarrow[a_2]{r = 0}s_8\xrightarrow[a_2]{r = 1}s_9.

    • Return = 01+0+1=00-1 + 0+1 = 0.

Infinite Trajectories and Divergence Problem#

  • Infinite Trajectory:Suppose a policy generates an infinitely long trajectory like
s1a2r=0s2a3r=0s5a3r=0s8a2r=1s9a5r=1s9a5r=1s9s_1\xrightarrow[a_2]{r = 0}s_2\xrightarrow[a_3]{r = 0}s_5\xrightarrow[a_3]{r = 0}s_8\xrightarrow[a_2]{r = 1}s_9\xrightarrow[a_5]{r = 1}s_9\xrightarrow[a_5]{r = 1}s_9\cdots

If we calculate the return as the direct sum of rewards 0+0+0+1+1+1+=0 + 0+0 + 1+1 + 1+\cdots=\infty

This makes it impossible to properly evaluate the policy using this simple sum.

To deal with infinitely long trajectories, we introduce the concept of discounted return.

Discounted return=0+γ0+γ20+γ31+γ41+γ51+=γ311γ,\text{Discounted return} = 0+\gamma0+\gamma^{2}0+\gamma^{3}1+\gamma^{4}1+\gamma^{5}1+\cdots=\gamma^{3}\frac{1}{1 - \gamma},

where γ(0,1)\gamma\in(0, 1) is the discount rate.

  • If γ\gamma is close to 00, the value of the discounted return is mainly determined by the rewards received in the near future.
  • If γ\gamma is close to 11, the value of the discounted return is mainly determined by the rewards received in the far future.

Markov decision process#

Key elements of MDP:

  • Sets:

    • State: the set of states S\mathcal{S}

    • Action: the set of actions A(s)\mathcal{A}(s) is associated for state sSs\in\mathcal{S}.

    • Reward: the set of rewards R(s,a)\mathcal{R}(s, a).

  • Probability distribution:

    • State transition probability: at state ss, taking action aa, the probability to transit to state ss' is p(ss,a)p(s'|s, a)

    • Reward probability: at state ss, taking action aa, the probability to get reward rr is p(rs,a)p(r|s, a)

  • Policy: at state ss, the probability to choose action aa is π(as)\pi(a|s)

  • Markov property: memoryless property

    • p(st+1at+1,st,,a1,s0)=p(st+1at+1,st)p(s_{t + 1}|a_{t + 1}, s_t, \dots, a_1, s_0)=p(s_{t + 1}|a_{t + 1}, s_t)

    • p(rt+1at+1,st,,a1,s0)=p(rt+1at+1,st)p(r_{t + 1}|a_{t + 1}, s_t, \dots, a_1, s_0)=p(r_{t + 1}|a_{t + 1}, s_t)

All the concepts introduced in this lecture can be put in the framework in MDP.

State Values and Bellman Equation#

Why are returns important?#

In fact, returns play a fundamental role in reinforcement learning since they can evaluate whether a policy is good or not. alt text

  • Following the first policy, the trajectory is s1s3s4s4s_1\rightarrow s_3\rightarrow s_4\rightarrow s_4\cdots. The corresponding discounted return is
return1=0+γ1+γ21+=γ(1+γ+γ2+)=γ1γ.\begin{align*} \text{return}_1&=0 + \gamma1+\gamma^{2}1+\cdots\\ &=\gamma(1 + \gamma+\gamma^{2}+\cdots)\\ &=\frac{\gamma}{1 - \gamma}. \end{align*}
  • Following the second policy, the trajectory is s1s2s4s4s_1\rightarrow s_2\rightarrow s_4\rightarrow s_4\cdots. The discounted return is
return2=1+γ1+γ21+=1+γ(1+γ+γ2+)=1+γ1γ.\begin{align*} \text{return}_2&=- 1+\gamma1+\gamma^{2}1+\cdots\\ &=-1+\gamma(1 + \gamma+\gamma^{2}+\cdots)\\ &=-1+\frac{\gamma}{1 - \gamma}. \end{align*}
  • Following the third policy, two trajectories can possibly be obtained. One is s1s3s4s4s_1\rightarrow s_3\rightarrow s_4\rightarrow s_4\cdots, and the other is s1s2s4s4s_1\rightarrow s_2\rightarrow s_4\rightarrow s_4\cdots. The probability of either of the two trajectories is 0.50.5. Then, the average return that can be obtained starting from s1s_1 is
return3=0.5(1+γ1γ)+0.5(γ1γ)=0.5+γ1γ.\begin{align*} \text{return}_3&=0.5\left(-1+\frac{\gamma}{1 - \gamma}\right)+0.5\left(\frac{\gamma}{1 - \gamma}\right)\\ &=-0.5+\frac{\gamma}{1 - \gamma}. \end{align*}

By comparing the returns of the three policies, we notice that

return1>return3>return2.\text{return}_1>\text{return}_3>\text{return}_2.

How to calculate returns?#

A return equals the discounted sum of all the rewards collected along a Let viv_i denote the return obtained by starting from sis_i for i=1,2,3,4i = 1,2,3,4. alt text

By definition#

Let viv_i denote the return obtained starting from sis_i (i=1,2,3,4i = 1,2,3,4)

v1=r1+γr2+γ2r3+;v2=r2+γr3+γ2r4+;v3=r3+γr4+γ2r1+;v4=r4+γr1+γ2r2+.\begin{gather*} v_1 = r_1+\gamma r_2+\gamma^{2}r_3+\cdots;\\ v_2 = r_2+\gamma r_3+\gamma^{2}r_4+\cdots;\\ v_3 = r_3+\gamma r_4+\gamma^{2}r_1+\cdots;\\ v_4 = r_4+\gamma r_1+\gamma^{2}r_2+\cdots. \end{gather*}

By substitution#

v1=r1+γ(r2+γr3+)=r1+γv2;v2=r2+γ(r3+γr4+)=r2+γv3;v3=r3+γ(r4+γr1+)=r3+γv4;v4=r4+γ(r1+γr2+)=r4+γv1.\begin{gather*} v_1 = r_1+\gamma(r_2+\gamma r_3+\cdots)=r_1+\gamma v_2;\\ v_2 = r_2+\gamma(r_3+\gamma r_4+\cdots)=r_2+\gamma v_3;\\ v_3 = r_3+\gamma(r_4+\gamma r_1+\cdots)=r_3+\gamma v_4;\\ v_4 = r_4+\gamma(r_1+\gamma r_2+\cdots)=r_4+\gamma v_1. \end{gather*}

Write in the following matrix - vector form:

[v1v2v3v4]=[r1r2r3r4]+[γv2γv3γv4γv1]=[r1r2r3r4]+γ[0100001000011000][v1v2v3v4]\begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \end{bmatrix} = \begin{bmatrix} r_1 \\ r_2 \\ r_3 \\ r_4 \end{bmatrix} + \begin{bmatrix} \gamma v_2 \\ \gamma v_3 \\ \gamma v_4 \\ \gamma v_1 \end{bmatrix} = \begin{bmatrix} r_1 \\ r_2 \\ r_3 \\ r_4 \end{bmatrix} + \gamma \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \end{bmatrix}

which can be rewritten as v=r+γPv.\mathbf{v}=\mathbf{r}+\gamma\mathbf{Pv}. We can solve for vv as v=(IγP)1r.\mathbf{v}=(I - \gamma \mathbf{P})^{-1}\mathbf{r}.

State value#

Single step Process#

  • Elements

    • t,t+1t,t + 1: Discrete time instances.

    • StS_t: State at time tt.

    • AtA_t: The action taken at state StS_t.

    • Rt+1R_{t + 1}: The reward obtained after taking AtA_t.

    • St+1S_{t + 1}: The state transitioned to after taking AtA_t.

    • Note that StS_t, AtA_t, Rt+1R_{t + 1} are all random variables.

  • Probability Distributions

    • StAtS_t\rightarrow A_t is governed by π(At=aSt=s)\pi(A_t = a|S_t = s).

    • St,AtRt+1S_t,A_t\rightarrow R_{t + 1} is governed by p(Rt+1=rSt=s,At=a)p(R_{t + 1}=r|S_t = s,A_t = a).

    • St,AtSt+1S_t,A_t\rightarrow S_{t + 1} is governed by p(St+1=sSt=s,At=a)p(S_{t + 1}=s'|S_t = s,A_t = a).

    • Assume we know the model (i.e., the probability distributions).

Multi step Trajectory#

  • Discounted Return (GtG_t)

    • Formula: Gt=Rt+1+γRt+2+γ2Rt+3+G_t=R_{t + 1}+\gamma R_{t + 2}+\gamma^{2}R_{t + 3}+\cdots, where γ[0,1)\gamma\in[0,1) is the discount rate.

    • GtG_t is a random variable since Rt+1R_{t + 1}, Rt+2,R_{t + 2},\cdots are random variables.

State value Function#

  • Definition

    • vπ(s)=E[GtSt=s]v_{\pi}(s)=\mathbb{E}[G_t|S_t = s], which is the expectation (expected value or mean) of GtG_t.
  • Remarks

    • It is a function of ss, a conditional expectation given that the state starts from ss.

    • It depends on the policy π\pi. Different policies may result in different state values.

    • Represents the “value” of a state. A larger state value indicates a better policy as it implies greater cumulative rewards can be obtained.

  • Relationship between Return and State Value

    • The state value is the mean of all possible returns starting from a state. If π(as)\pi(a|s), p(rs,a)p(r|s,a), p(ss,a)p(s'|s,a) are deterministic, then the state value is the same as the return.
  • Example alt text

vπ1(s1)=0+γ1+γ21+=γ(1+γ+γ2+)=γ1γvπ2(s1)=1+γ1+γ21+=1+γ(1+γ+γ2+)=1+γ1γvπ3(s1)=0.5(1+γ1γ)+0.5(γ1γ)=0.5+γ1γ\begin{align*} &v_{\pi_1}(s_1)=0+\gamma1+\gamma^{2}1+\cdots=\gamma(1 + \gamma+\gamma^{2}+\cdots)=\frac{\gamma}{1 - \gamma}\\ &v_{\pi_2}(s_1)=-1+\gamma1+\gamma^{2}1+\cdots=-1+\gamma(1 + \gamma+\gamma^{2}+\cdots)=-1+\frac{\gamma}{1 - \gamma}\\ &v_{\pi_3}(s_1)=0.5\left(-1+\frac{\gamma}{1 - \gamma}\right)+0.5\left(\frac{\gamma}{1 - \gamma}\right)=-0.5+\frac{\gamma}{1 - \gamma} \end{align*}

Bellman equation#

Consider a random trajectory: StAtRt+1,St+1At+1Rt+2,St+2At+2Rt+3,S_t\xrightarrow{A_t}R_{t + 1},S_{t + 1}\xrightarrow{A_{t + 1}}R_{t + 2},S_{t + 2}\xrightarrow{A_{t + 2}}R_{t + 3},\cdots

The return GtG_t can be written as

Gt=Rt+1+γRt+2+γ2Rt+3+=Rt+1+γ(Rt+2+γRt+3+)=Rt+1+γGt+1\begin{align*} G_t&=R_{t + 1}+\gamma R_{t + 2}+\gamma^{2}R_{t + 3}+\cdots\\ &=R_{t + 1}+\gamma(R_{t + 2}+\gamma R_{t + 3}+\cdots)\\ &=R_{t + 1}+\gamma G_{t + 1} \end{align*}

Then, it follows from the definition of the state value that

vπ(s)=E[GtSt=s]=E[Rt+1+γGt+1St=s]=E[Rt+1St=s]+γE[Gt+1St=s]\begin{align*} v_{\pi}(s)&=\mathbb{E}[G_t|S_t = s]\\ &=\mathbb{E}[R_{t + 1}+\gamma G_{t + 1}|S_t = s]\\ &=\mathbb{E}[R_{t + 1}|S_t = s]+\gamma\mathbb{E}[G_{t + 1}|S_t = s] \end{align*}

First, calculate the first term E[Rt+1St=s]\mathbb{E}[R_{t + 1}|S_t = s]

E[Rt+1St=s]=aπ(as)E[Rt+1St=s,At=a]  (当前状态选择任一状态的策略概率)=aπ(as)rp(rs,a)r  (当前状态和动作到任一状态的回报概率)\begin{align*} \mathbb{E}[R_{t + 1}|S_t = s]&=\sum_{a}\pi(a|s)\mathbb{E}[R_{t + 1}|S_t = s,A_t = a]~~\text{(当前状态选择任一状态的策略概率)}\\ &=\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r~~\text{(当前状态和动作到任一状态的回报概率)} \end{align*}

Second, calculate the second term E[Gt+1St=s]\mathbb{E}[G_{t + 1}|S_t = s]

E[Gt+1St=s]=sE[Gt+1St=s,St+1=s]p(ss)  (当前状态到任一状态的概率)=sE[Gt+1St+1=s]p(ss)  (Markov property)=svπ(s)p(ss)=svπ(s)ap(ss,a)π(as)  (当前状态到任一状态的策略概率)\begin{align*} \mathbb{E}[G_{t + 1}|S_t = s]&=\sum_{s'}\mathbb{E}[G_{t + 1}|S_t = s,S_{t + 1}=s']p(s'|s)~~\text{(当前状态到任一状态的概率)}\\ &=\sum_{s'}\mathbb{E}[G_{t + 1}|S_{t + 1}=s']p(s'|s)~~\text{(Markov property)}\\ &=\sum_{s'}v_{\pi}(s')p(s'|s)\\ &=\sum_{s'}v_{\pi}(s')\sum_{a}p(s'|s,a)\pi(a|s)~~\text{(当前状态到任一状态的策略概率)} \end{align*}

Therefore, we have

vπ(s)=E[Rt+1St=s]+γE[Gt+1St=s],=aπ(as)rp(rs,a)rmean of immediate rewards+γaπ(as)sp(ss,a)vπ(s)mean of future rewards,=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)],sS.\begin{align*} v_{\pi}(s)&=\mathbb{E}[R_{t + 1}|S_t = s]+\gamma\mathbb{E}[G_{t + 1}|S_t = s],\\ &=\underbrace{\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r}_{\text{mean of immediate rewards}}+\gamma\underbrace{\sum_{a}\pi(a|s)\sum_{s'}p(s'|s,a)v_{\pi}(s')}_{\text{mean of future rewards}},\\ &=\sum_{a}\pi(a|s)\left[\sum_{r}p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v_{\pi}(s')\right],\quad\forall s\in\mathcal{S}. \end{align*}

Every state has an equation like this!!!

Example 1 alt text

Consider the state value of s1s_1:

  • π(a=a3s1)=1\pi(a = a_3|s_1)=1 and π(aa3s1)=0\pi(a\neq a_3|s_1)=0.

  • p(s=s3s1,a3)=1p(s' = s_3|s_1,a_3)=1 and p(ss3s1,a3)=0p(s'\neq s_3|s_1,a_3)=0.

  • p(r=0s1,a3)=1p(r = 0|s_1,a_3)=1 and p(r0s1,a3)=0p(r\neq 0|s_1,a_3)=0.

Substituting them into the Bellman equation gives and similarly, we have

vπ(s1)=0+γvπ(s3),  vπ(s2)=1+γvπ(s4),vπ(s3)=1+γvπ(s4),  vπ(s4)=1+γvπ(s4).\begin{gather*} v_{\pi}(s_1)=0+\gamma v_{\pi}(s_3),~~v_{\pi}(s_2)=1+\gamma v_{\pi}(s_4),\\ v_{\pi}(s_3)=1+\gamma v_{\pi}(s_4),~~v_{\pi}(s_4)=1+\gamma v_{\pi}(s_4). \end{gather*}

Solve the above equations one by one from the last to the first:

vπ(s4)=11γ,  vπ(s3)=11γ,vπ(s2)=11γ,  vπ(s1)=γ1γ.\begin{gather*} v_{\pi}(s_4)=\frac{1}{1 - \gamma},~~v_{\pi}(s_3)=\frac{1}{1 - \gamma},\\ v_{\pi}(s_2)=\frac{1}{1 - \gamma},~~v_{\pi}(s_1)=\frac{\gamma}{1 - \gamma}. \end{gather*}

Substituting γ=0.9\gamma = 0.9 yields

vπ(s4)=10, vπ(s3)=10, vπ(s2)=10, vπ(s1)=9v_{\pi}(s_4)=10,~v_{\pi}(s_3)=10,~v_{\pi}(s_2)=10,~v_{\pi}(s_1)=9.

Example 2 alt text

We have

vπ(s1)=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)], vπ(s2)=1+γvπ(s4),vπ(s3)=1+γvπ(s4), vπ(s4)=1+γvπ(s4)\begin{align*} &v_{\pi}(s_1)=0.5[0+\gamma v_{\pi}(s_3)] + 0.5[-1+\gamma v_{\pi}(s_2)],~v_{\pi}(s_2)=1+\gamma v_{\pi}(s_4),\\ &v_{\pi}(s_3)=1+\gamma v_{\pi}(s_4),~v_{\pi}(s_4)=1+\gamma v_{\pi}(s_4)\\ \end{align*}

Solve the above equations one by one from the last to the first.

vπ(s4)=11γ, vπ(s3)=11γ, vπ(s2)=11γ,vπ(s1)=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)]=0.5+γ1γ.\begin{gather*} v_{\pi}(s_4)=\frac{1}{1 - \gamma},~v_{\pi}(s_3)=\frac{1}{1 - \gamma},~v_{\pi}(s_2)=\frac{1}{1 - \gamma},\\ v_{\pi}(s_1)=0.5[0+\gamma v_{\pi}(s_3)] + 0.5[-1+\gamma v_{\pi}(s_2)]=-0.5+\frac{\gamma}{1 - \gamma}. \end{gather*}

Substituting γ=0.9\gamma = 0.9 yields

vπ(s4)=10v_{\pi}(s_4)=10, vπ(s3)=10v_{\pi}(s_3)=10, vπ(s2)=10v_{\pi}(s_2)=10, vπ(s1)=0.5+9=8.5v_{\pi}(s_1)=-0.5 + 9=8.5.

Matrix vector form#

Recall that:

vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]v_{\pi}(s)=\sum_{a}\pi(a|s)\left[\sum_{r}p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v_{\pi}(s')\right]

Rewrite the Bellman equation as

vπ(s)=rπ(s)+γspπ(ss)vπ(s)v_{\pi}(s)=r_{\pi}(s)+\gamma\sum_{s'}p_{\pi}(s'|s)v_{\pi}(s')

where

rπ(s)aπ(as)rp(rs,a)r,pπ(ss)aπ(as)p(ss,a)r_{\pi}(s)\triangleq\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r,\quad p_{\pi}(s'|s)\triangleq\sum_{a}\pi(a|s)p(s'|s,a)

Suppose the states could be indexed as sis_i (i=1,,ni = 1,\cdots,n).

For state sis_i, the Bellman equation is

vπ(si)=rπ(si)+γsjpπ(sjsi)vπ(sj)v_{\pi}(s_i)=r_{\pi}(s_i)+\gamma\sum_{s_j}p_{\pi}(s_j|s_i)v_{\pi}(s_j)

Put all these equations for all the states together and rewrite to a matrix - vector form

vπ=rπ+γPπvπv_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}

where

  • vπ=[vπ(s1),,vπ(sn)]TRnv_{\pi}=[v_{\pi}(s_1),\cdots,v_{\pi}(s_n)]^{T}\in\mathbb{R}^{n}

  • rπ=[rπ(s1),,rπ(sn)]TRnr_{\pi}=[r_{\pi}(s_1),\cdots,r_{\pi}(s_n)]^{T}\in\mathbb{R}^{n}

  • PπRn×nP_{\pi}\in\mathbb{R}^{n\times n}, where [Pπ]ij=pπ(sjsi)[P_{\pi}]_{ij}=p_{\pi}(s_j|s_i), is the state transition matrix

If there are four states, vπ=rπ+γPπvπv_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi} can be written out as

[vπ(s1)vπ(s2)vπ(s3)vπ(s4)]=[rπ(s1)rπ(s2)rπ(s3)rπ(s4)]+γ[pπ(s1s1)pπ(s2s1)pπ(s3s1)pπ(s4s1)pπ(s1s2)pπ(s2s2)pπ(s3s2)pπ(s4s2)pπ(s1s3)pπ(s2s3)pπ(s3s3)pπ(s4s3)pπ(s1s4)pπ(s2s4)pπ(s3s4)pπ(s4s4)][vπ(s1)vπ(s2)vπ(s3)vπ(s4)]\begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} = \begin{bmatrix} r_{\pi}(s_1)\\ r_{\pi}(s_2)\\ r_{\pi}(s_3)\\ r_{\pi}(s_4) \end{bmatrix} +\gamma \begin{bmatrix} p_{\pi}(s_1|s_1)&p_{\pi}(s_2|s_1)&p_{\pi}(s_3|s_1)&p_{\pi}(s_4|s_1)\\ p_{\pi}(s_1|s_2)&p_{\pi}(s_2|s_2)&p_{\pi}(s_3|s_2)&p_{\pi}(s_4|s_2)\\ p_{\pi}(s_1|s_3)&p_{\pi}(s_2|s_3)&p_{\pi}(s_3|s_3)&p_{\pi}(s_4|s_3)\\ p_{\pi}(s_1|s_4)&p_{\pi}(s_2|s_4)&p_{\pi}(s_3|s_4)&p_{\pi}(s_4|s_4) \end{bmatrix} \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix}

Example alt text

For this specific example:

[vπ(s1)vπ(s2)vπ(s3)vπ(s4)]=[0111]+γ[0010000100010001][vπ(s1)vπ(s2)vπ(s3)vπ(s4)]\begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} = \begin{bmatrix} 0\\ 1\\ 1\\ 1 \end{bmatrix} +\gamma \begin{bmatrix} 0&0&1&0\\ 0&0&0&1\\ 0&0&0&1\\ 0&0&0&1 \end{bmatrix} \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix}

Example alt text

For this specific example:

[vπ(s1)vπ(s2)vπ(s3)vπ(s4)]=[0.5(0)+0.5(1)111]+γ[00.50.50000100010001][vπ(s1)vπ(s2)vπ(s3)vπ(s4)]\begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} = \begin{bmatrix} 0.5(0)+0.5(-1)\\ 1\\ 1\\ 1 \end{bmatrix} +\gamma \begin{bmatrix} 0&0.5&0.5&0\\ 0&0&0&1\\ 0&0&0&1\\ 0&0&0&1 \end{bmatrix} \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix}

Solve state values#

The Bellman equation in matrix vector form is

vπ=rπ+γPπvπvπ=(IγPπ)1rπv_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}\Longrightarrow v_{\pi}=(I - \gamma P_{\pi})^{-1}r_{\pi}

In practice, we still need to use numerical tools to calculate the matrix inverse.

Can we avoid the matrix inverse operation? Yes, by iterative algorithms.

An iterative solution is:

vk+1=rπ+γPπvk.v_{k + 1}=r_{\pi}+\gamma P_{\pi}v_{k}.

This algorithm leads to a sequence {v0,v1,v2,}\{v_0, v_1, v_2,\cdots\}. We can show that

vkvπ=(IγPπ)1rπ,k.v_{k}\rightarrow v_{\pi}=(I - \gamma P_{\pi})^{-1}r_{\pi},\quad k\rightarrow\infty.

Proof. Define the error as δk=vkvπ\delta_{k}=v_{k}-v_{\pi}. We only need to show δk0\delta_{k}\rightarrow0. Substituting vk+1=δk+1+vπv_{k + 1}=\delta_{k + 1}+v_{\pi} and vk=δk+vπv_{k}=\delta_{k}+v_{\pi} into vk+1=rπ+γPπvkv_{k + 1}=r_{\pi}+\gamma P_{\pi}v_{k} gives

δk+1+vπ=rπ+γPπ(δk+vπ),\delta_{k + 1}+v_{\pi}=r_{\pi}+\gamma P_{\pi}(\delta_{k}+v_{\pi}),

which can be rewritten as

δk+1=vπ+rπ+γPπδk+γPπvπ=γPπδk.\delta_{k + 1}=-v_{\pi}+r_{\pi}+\gamma P_{\pi}\delta_{k}+\gamma P_{\pi}v_{\pi}=\gamma P_{\pi}\delta_{k}.

As a result,

δk+1=γPπδk=γ2Pπ2δk1==γk+1Pπk+1δ0.\delta_{k + 1}=\gamma P_{\pi}\delta_{k}=\gamma^{2}P_{\pi}^{2}\delta_{k - 1}=\cdots=\gamma^{k + 1}P_{\pi}^{k + 1}\delta_{0}.

Note that 0Pπk10\leq P_{\pi}^{k}\leq1, which means every entry of PπkP_{\pi}^{k} is no greater than 1 for any k=0,1,2,k = 0,1,2,\cdots.

That is because Pπk1=1P_{\pi}^{k}\mathbf{1}=\mathbf{1}, where 1=[1,,1]T\mathbf{1}=[1,\cdots,1]^{T}. On the other hand, since γ<1\gamma < 1, we know γk0\gamma^{k}\rightarrow0 and hence δk+1=γk+1Pπk+1δ00\delta_{k + 1}=\gamma^{k + 1}P_{\pi}^{k + 1}\delta_{0}\rightarrow0 as kk\rightarrow\infty.

Action Value#

The action value of a state - action pair (s,a)(s, a) is defined as

qπ(s,a)E[GtSt=s,At=a].q_{\pi}(s,a)\doteq\mathbb{E}[G_t|S_t = s,A_t = a].

It represents the expected return obtained after taking action aa at state ss.

It depends on the state - action pair (s,a)(s, a) rather than just an action.

The relationship between action values and state values

First, it follows from the properties of conditional expectation that

vπ(s)=E[GtSt=s]vπ(s)=aAE[GtSt=s,At=a]qπ(s,a)π(as).v_{\pi}(s)=\underbrace{\mathbb{E}[G_t|S_t = s]}_{v_{\pi}(s)}=\sum_{a\in\mathcal{A}}\underbrace{\mathbb{E}[G_t|S_t = s,A_t = a]}_{q_{\pi}(s,a)}\pi(a|s).

It then follows that vπ(s)=aAπ(as)qπ(s,a)v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_{\pi}(s,a) As a result, a state value is the expectation of the action values associated with that state.

Second, since the state value is given by

vπ(s)=aAπ(as)[rRp(rs,a)r+γsSp(ss,a)vπ(s)]v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right]

which leads to qπ(s,a)=rRp(rs,a)r+γsSp(ss,a)vπ(s).q_{\pi}(s,a)=\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s').

Example

alt text

qπ(s1,a1)=1+γvπ(s1),qπ(s1,a3)=0+γvπ(s3),qπ(s1,a4)=1+γvπ(s1),qπ(s1,a5)=0+γvπ(s1).\begin{gather*} q_{\pi}(s_1,a_1)&=-1+\gamma v_{\pi}(s_1),\\ q_{\pi}(s_1,a_3)&=0+\gamma v_{\pi}(s_3),\\ q_{\pi}(s_1,a_4)&=-1+\gamma v_{\pi}(s_1),\\ q_{\pi}(s_1,a_5)&=0+\gamma v_{\pi}(s_1). \end{gather*}

Bellman optimality equation#

Optimal#

The state value could be used to evaluate if a policy is good or not: if

vπ1(s)vπ2(s)for all sSv_{\pi_1}(s)\geq v_{\pi_2}(s)\quad\text{for all }s\in\mathcal{S}

then π1\pi_1 is “better” than π2\pi_2.

A policy π\pi^* is optimal if vπ(s)vπ(s)v_{\pi^*}(s)\geq v_{\pi}(s) for all ss and for any other policy π\pi.

For every sSs\in\mathcal{S}, the elementwise expression of the BOE is

v(s)=maxπ(s)Π(s)aAπ(as)(rRp(rs,a)r+γsSp(ss,a)v(s))=maxπ(s)Π(s)aAπ(as)q(s,a),\begin{align*} v(s)&=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)\left(\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v(s')\right)\\ &=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)q(s,a), \end{align*}

where v(s),v(s)v(s),v(s') are unknown variables to be solved and

q(s,a)rRp(rs,a)r+γsSp(ss,a)v(s)q(s,a)\doteq\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v(s')

Bellman optimality equation (matrix - vector form):

v=maxπ(rπ+γPπv)v = \max_{\pi}(r_{\pi}+\gamma P_{\pi}v)

where the elements corresponding to ss or ss' are

[rπ]saπ(as)rp(rs,a)r,[r_{\pi}]_{s}\triangleq\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r,

[Pπ]s,s=p(ss)aπ(as)sp(ss,a).[P_{\pi}]_{s,s'}=p(s'|s)\triangleq\sum_{a}\pi(a|s)\sum_{s'}p(s'|s,a).

Here maxπ\max_{\pi} is performed elementwise.

Example: Consider two variables x,aRx,a\in\mathbb{R}. Suppose they satisfy

x=maxa(2x1a2)x = \max_{a}(2x - 1 - a^{2})

This equation has two unknowns. To solve them, first consider the right hand side. Regardless the value of xx, maxa(2x1a2)=2x1\max_{a}(2x - 1 - a^{2})=2x - 1 where the maximization is achieved when a=0a = 0. Second, when a=0a = 0, the equation becomes x=2x1x = 2x - 1, which leads to x=1x = 1. Therefore, a=0a = 0 and x=1x = 1 are the solution of the equation.

Example (How to solve maxπaπ(as)q(s,a)\max_{\pi}\sum_{a}\pi(a|s)q(s,a)) Suppose q1,q2,q3Rq_1,q_2,q_3\in\mathbb{R} are given. Find c1,c2,c3c_1^*,c_2^*,c_3^* solving

maxc1,c2,c3c1q1+c2q2+c3q3\max_{c_1,c_2,c_3}c_1q_1 + c_2q_2 + c_3q_3

where c1+c2+c3=1c_1 + c_2 + c_3 = 1 and c1,c2,c30c_1,c_2,c_3\geq0.

Without loss of generality, suppose q3q1,q2q_3\geq q_1,q_2. Then, the optimal solution is c3=1c_3^* = 1 and c1=c2=0c_1^* = c_2^* = 0. That is because for any c1,c2,c3c_1,c_2,c_3

q3=(c1+c2+c3)q3=c1q3+c2q3+c3q3c1q1+c2q2+c3q3q_3=(c_1 + c_2 + c_3)q_3=c_1q_3 + c_2q_3 + c_3q_3\geq c_1q_1 + c_2q_2 + c_3q_3

Maximization on the right hand side of BOE#

Fix v(s)v'(s) first and solve π\pi:

v(s)=maxπaπ(as)(rp(rs,a)r+γsp(ss,a)v(s)),sS=maxπaπ(as)q(s,a)\begin{align*} v(s)&=\max_{\pi}\sum_{a}\pi(a|s)\left(\sum_{r}p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v(s')\right),\quad\forall s\in\mathcal{S}\\ &=\max_{\pi}\sum_{a}\pi(a|s)q(s,a) \end{align*}

Inspired by the above example, considering that aπ(as)=1\sum_{a}\pi(a|s)=1, we have

maxπaπ(as)q(s,a)=maxaA(s)q(s,a)\max_{\pi}\sum_{a}\pi(a|s)q(s,a)=\max_{a\in\mathcal{A}(s)}q(s,a)

where the optimality is achieved when

π(as)={1a=a0aa\pi(a|s)= \begin{cases} 1 & a = a^*\\ 0 & a\neq a^* \end{cases}

where a=argmaxaq(s,a)a^*=\arg\max_{a}q(s,a).

Matrix vector form of the BOE#

The BOE refers to a set of equations defined for all states. If we combine these equations, we can obtain a concise matrix - vector form, which will be extensively used in this chapter.

The matrix - vector form of the BOE is

v=maxπΠ(rπ+γPπv),v=\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v),

where vRSv\in\mathbb{R}^{|\mathcal{S}|} and maxπ\max_{\pi} is performed in an elementwise manner. The structures of rπr_{\pi} and PπP_{\pi} are the same as those in the matrix - vector form of the normal Bellman equation:

[rπ]saAπ(as)rRp(rs,a)r,[Pπ]s,s=p(ss)aAπ(as)p(ss,a)[r_{\pi}]_{s}\doteq\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r,\quad[P_{\pi}]_{s,s'}=p(s'|s)\doteq\sum_{a\in\mathcal{A}}\pi(a|s)p(s'|s,a)

Since the optimal value of π\pi is determined by vv, the right - hand side of (3.2) is a function of vv, denoted as

f(v)maxπΠ(rπ+γPπv)f(v)\doteq\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v)

Then, the BOE can be expressed in a concise form as

v=f(v)v = f(v)

In the remainder of this section, we show how to solve this nonlinear equation.

Contraction property of the BOE#

Theorem: The function

f(v)=maxπΠ(rπ+γPπv)f(v)=\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v)

on the right hand side of the BOE is a contraction mapping. In particular, for any v1,v2RSv_1,v_2\in\mathbb{R}^{|\mathcal{S}|}, it holds that

f(v1)f(v2)γv1v2,\|f(v_1)-f(v_2)\|_{\infty}\leq\gamma\|v_1 - v_2\|_{\infty},

where γ(0,1)\gamma\in(0,1) is the discount rate, and \|\cdot\|_{\infty} is the maximum norm, which is the maximum absolute value of the elements of a vector.

Proof. Consider any two vectors v1,v2RSv_1,v_2\in\mathbb{R}^{|\mathcal{S}|}, and suppose that

π1argmaxπ(rπ+γPπv1)\pi_1^*\doteq\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_1)

and π2argmaxπ(rπ+γPπv2).\pi_2^*\doteq\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_2). Then,

f(v1)=maxπ(rπ+γPπv1)=rπ1+γPπ1v1rπ2+γPπ2v1,f(v2)=maxπ(rπ+γPπv2)=rπ2+γPπ2v2rπ1+γPπ1v2,\begin{align*} f(v_1)&=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_1)=r_{\pi_1^*}+\gamma P_{\pi_1^*}v_1\geq r_{\pi_2^*}+\gamma P_{\pi_2^*}v_1,\\ f(v_2)&=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_2)=r_{\pi_2^*}+\gamma P_{\pi_2^*}v_2\geq r_{\pi_1^*}+\gamma P_{\pi_1^*}v_2, \end{align*}

As a result,

f(v1)f(v2)=rπ1+γPπ1v1(rπ2+γPπ2v2)rπ1+γPπ1v1(rπ1+γPπ1v2)=γPπ1(v1v2).\begin{align*} f(v_1)-f(v_2)&=r_{\pi_1^*}+\gamma P_{\pi_1^*}v_1-(r_{\pi_2^*}+\gamma P_{\pi_2^*}v_2)\\ &\leq r_{\pi_1^*}+\gamma P_{\pi_1^*}v_1-(r_{\pi_1^*}+\gamma P_{\pi_1^*}v_2)\\ &=\gamma P_{\pi_1^*}(v_1 - v_2). \end{align*}

Similarly, it can be shown that f(v2)f(v1)γPπ2(v2v1)f(v_2)-f(v_1)\leq\gamma P_{\pi_2^*}(v_2 - v_1). Therefore,

γPπ2(v1v2)f(v1)f(v2)γPπ1(v1v2).\gamma P_{\pi_2^*}(v_1 - v_2)\leq f(v_1)-f(v_2)\leq\gamma P_{\pi_1^*}(v_1 - v_2).

Define

zmax{γPπ2(v1v2),γPπ1(v1v2)}RS,z\doteq\max\{|\gamma P_{\pi_2^*}(v_1 - v_2)|,|\gamma P_{\pi_1^*}(v_1 - v_2)|\}\in\mathbb{R}^{|\mathcal{S}|},

where max()\max(\cdot), |\cdot|, and \geq are all elementwise operators. By definition, z0z\geq0. On one hand, it is easy to see that

zγPπ2(v1v2)f(v1)f(v2)γPπ1(v1v2)z,-z\leq\gamma P_{\pi_2^*}(v_1 - v_2)\leq f(v_1)-f(v_2)\leq\gamma P_{\pi_1^*}(v_1 - v_2)\leq z,

which implies

f(v1)f(v2)z.|f(v_1)-f(v_2)|\leq z.

It then follows that

f(v1)f(v2)z,\|f(v_1)-f(v_2)\|_{\infty}\leq\|z\|_{\infty},

where \|\cdot\|_{\infty} is the maximum norm.

On the other hand, suppose that ziz_i is the iith entry of zz, and piTp_i^T and qiTq_i^T are the iith row of Pπ1P_{\pi_1^*} and Pπ2P_{\pi_2^*}, respectively. Then,

zi=max{γpiT(v1v2),γqiT(v1v2)}.z_i=\max\{\gamma|p_i^T(v_1 - v_2)|,\gamma|q_i^T(v_1 - v_2)|\}.

Since pip_i is a vector with all nonnegative elements and the sum of the elements is equal to one, it follows that

piT(v1v2)piTv1v2v1v2.|p_i^T(v_1 - v_2)|\leq p_i^T\cdot|v_1 - v_2|\leq\|v_1 - v_2\|_{\infty}.

Similarly, we have qiT(v1v2)v1v2|q_i^T(v_1 - v_2)|\leq\|v_1 - v_2\|_{\infty}. Therefore, ziγv1v2z_i\leq\gamma\|v_1 - v_2\|_{\infty} and hence

z=maxiziγv1v2.\|z\|_{\infty}=\max_{i}|z_i|\leq\gamma\|v_1 - v_2\|_{\infty}.

Substituting this inequality to (3.5) gives

f(v1)f(v2)γv1v2,\|f(v_1)-f(v_2)\|_{\infty}\leq\gamma\|v_1 - v_2\|_{\infty},

which concludes the proof of the contraction property of f(v)f(v).

Theorem (Existence, Uniqueness, and Algorithm)

For the BOE v=f(v)=maxπ(rπ+γPπv)v = f(v)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v), there always exists a solution vv^* and the solution is unique. The solution could be solved iteratively by

vk+1=f(vk)=maxπ(rπ+γPπvk)v_{k + 1}=f(v_k)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_k)

This sequence {vk}\{v_k\} converges to vv^* exponentially fast given any initial guess v0v_0. The convergence rate is determined by γ\gamma.

Optimal policy#

Suppose vv^* is the solution to the Bellman optimality equation. It satisfies

v=maxπ(rπ+γPπv)v^*=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*)

Suppose

π=argmaxπ(rπ+γPπv)\pi^*=\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*)

Then

v=rπ+γPπvv^*=r_{\pi^*}+\gamma P_{\pi^*}v^*

Therefore, π\pi^* is a policy and v=vπv^* = v_{\pi^*} is the corresponding state value.

Theorem 3.4 (Optimality of vv^* and π\pi^*)

The solution vv^* is the optimal state value, and π\pi^* is an optimal policy. That is, for any policy π\pi, it holds that

v=vπvπ,v^* = v_{\pi^*}\geq v_{\pi},

where vπv_{\pi} is the state value of π\pi, and \geq is an elementwise comparison.

Proof. For any policy π\pi, it holds that

vπ=rπ+γPπvπ.v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}.

Since

v=maxπ(rπ+γPπv)=rπ+γPπvrπ+γPπv,v^*=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*)=r_{\pi^*}+\gamma P_{\pi^*}v^*\geq r_{\pi}+\gamma P_{\pi}v^*,

we have

vvπ(rπ+γPπv)(rπ+γPπvπ)=γPπ(vvπ).v^*-v_{\pi}\geq(r_{\pi}+\gamma P_{\pi}v^*)-(r_{\pi}+\gamma P_{\pi}v_{\pi})=\gamma P_{\pi}(v^*-v_{\pi}).

Repeatedly applying the above inequality gives

vvπγPπ(vvπ)γ2Pπ2(vvπ)γnPπn(vvπ).v^*-v_{\pi}\geq\gamma P_{\pi}(v^*-v_{\pi})\geq\gamma^{2}P_{\pi}^{2}(v^*-v_{\pi})\geq\cdots\geq\gamma^{n}P_{\pi}^{n}(v^*-v_{\pi}).

It follows that

vvπlimnγnPπn(vvπ)=0,v^*-v_{\pi}\geq\lim_{n\rightarrow\infty}\gamma^{n}P_{\pi}^{n}(v^*-v_{\pi}) = 0,

where the last equality is true because γ<1\gamma<1 and PπnP_{\pi}^{n} is a nonnegative matrix with all its elements less than or equal to 11 (because Pπn1=1P_{\pi}^{n}\mathbf{1}=\mathbf{1}). Therefore, vvπv^*\geq v_{\pi} for any π\pi.

Theorem (Greedy optimal policy)

For any sSs\in\mathcal{S}, the deterministic greedy policy

π(as)={1,a=a(s)0,aa(s)\pi^*(a|s)= \begin{cases} 1, & a = a^*(s)\\ 0, & a\neq a^*(s) \end{cases}

is an optimal policy for solving the BOE. Here,

a(s)=argmaxaq(a,s),a^*(s)=\arg\max_{a}q^*(a,s),

where

q(s,a)rRp(rs,a)r+γsSp(ss,a)v(s)q^*(s,a)\doteq\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v^*(s')

Proof.

While the matrix - vector form of the optimal policy is π=argmaxπ(rπ+γPπv)\pi^*=\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*), its elementwise form is

π(s)=argmaxπΠaAπ(as)(rRp(rs,a)r+γsSp(ss,a)v(s))q(s,a),sS\pi^*(s)=\arg\max_{\pi\in\Pi}\sum_{a\in\mathcal{A}}\pi(a|s)\underbrace{\left(\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v^*(s')\right)}_{q^*(s,a)},\quad s\in\mathcal{S}

It is clear that aAπ(as)q(s,a)\sum_{a\in\mathcal{A}}\pi(a|s)q^*(s,a) is maximized if π(s)\pi(s) selects the action with the greatest q(s,a)q^*(s,a).

Factors that influence optimal policies#

The BOE is a powerful tool for analyzing optimal policies. We next apply the BOE to study what factors can influence optimal policies. This question can be easily answered by observing the elementwise expression of the BOE:

v(s)=maxπ(s)Π(s)aAπ(as)(rRp(rs,a)r+γsSp(ss,a)v(s)),sS.v(s)=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a | s)\left(\sum_{r\in\mathcal{R}}p(r | s, a)r+\gamma\sum_{s'\in\mathcal{S}}p(s' | s, a)v(s')\right),\quad s\in\mathcal{S}.

The optimal state value and optimal policy are determined by the following parameters:

  1. the immediate reward rr,

  2. the discount rate γ\gamma,

  3. the system model p(ss,a),p(rs,a)p(s' | s, a), p(r | s, a).

While the system model is fixed, we next discuss how the optimal policy varies when we change the values of rr and γ\gamma.

Impact of the discount rate#

alt text alt text

Impact of the reward values#

Theorem (Optimal policy invariance)

Consider a Markov decision process with vRSv^*\in\mathbb{R}^{|\mathcal{S}|} as the optimal state value satisfying

v=maxπΠ(rπ+γPπv).v^* = \max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v^*).

If every reward rRr\in\mathcal{R} is changed by an affine transformation to αr+β\alpha r+\beta, where α,βR\alpha,\beta\in\mathbb{R} and α>0\alpha > 0, then the corresponding optimal state value vv' is also an affine transformation of vv^*:

v=αv+β1γ1v'=\alpha v^*+\frac{\beta}{1 - \gamma}\mathbf{1}

where γ(0,1)\gamma\in(0,1) is the discount rate and 1=[1,,1]T\mathbf{1} = [1,\ldots,1]^T.

Consequently, the optimal policy derived from vv' is invariant to the affine transformation of the reward values.

Proof. For any policy π\pi, define rπ=[,rπ(s),]Tr_{\pi}=[\ldots,r_{\pi}(s),\ldots]^T where rπ(s)=aAπ(as)rRp(rs,a)r,sS.r_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r,\quad s\in\mathcal{S}. If rαr+βr\to\alpha r+\beta, then rπ(s)αrπ(s)+βr_{\pi}(s)\to\alpha r_{\pi}(s)+\beta and hence rπαrπ+β1r_{\pi}\to\alpha r_{\pi}+\beta\mathbf{1}, where 1=[1,,1]T\mathbf{1}=[1,\ldots,1]^T. In this case, the BOE becomes

v=maxπΠ(αrπ+β1+γPπv).v'=\max_{\pi\in\Pi}(\alpha r_{\pi}+\beta\mathbf{1}+\gamma P_{\pi}v').

We next solve the new BOE by showing that v=αv+c1v'=\alpha v^* + c\mathbf{1} with c=β/(1γ)c = \beta/(1 - \gamma) is a solution. In particular, substituting v=αv+c1v'=\alpha v^* + c\mathbf{1} gives

αv+c1=maxπΠ(αrπ+β1+γPπ(αv+c1))=maxπΠ(αrπ+β1+αγPπv+γc1),\alpha v^*+c\mathbf{1}=\max_{\pi\in\Pi}(\alpha r_{\pi}+\beta\mathbf{1}+\gamma P_{\pi}(\alpha v^* + c\mathbf{1}))=\max_{\pi\in\Pi}(\alpha r_{\pi}+\beta\mathbf{1}+\alpha\gamma P_{\pi}v^*+\gamma c\mathbf{1}),

where the last equality is due to the fact that Pπ1=1P_{\pi}\mathbf{1}=\mathbf{1}. The above equation can be reorganized as αv=maxπΠ(αrπ+αγPπv)+β1+γc1c1,\alpha v^*=\max_{\pi\in\Pi}(\alpha r_{\pi}+\alpha\gamma P_{\pi}v^*)+\beta\mathbf{1}+\gamma c\mathbf{1}-c\mathbf{1}, which is equivalent to

β1+γc1c1=0.\beta\mathbf{1}+\gamma c\mathbf{1}-c\mathbf{1}=0.

Since c=β/(1γ)c = \beta/(1 - \gamma), the above equation is valid and hence v=αv+c1v'=\alpha v^*+c\mathbf{1} is the solution.

Since is the BOE, vv' is also the unique solution. Finally, since vv' is an affine transformation of vv^*, the relative relationships between the action values remain the same.

Hence, the greedy optimal policy derived from vv' is the same as that from vv^*: argmaxπΠ(rπ+γPπv)\arg\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v') is the same as argmaxπ(rπ+γPπv)\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*).

Reinforcement Learning
https://blog.mcj.life/posts/250407reinforcement-learning/
Author
CiMorn
Published at
2025-04-07