Reinforcement Learning

2025-04-07

Basic Concepts#

Law of Total Probability#

Discrete Random Variables#

Partition Conditions:
- Events $B_1,B_2,\cdots,B_n$ form a partition of sample space $\Omega$ .
- $B_i\cap B_j = \varnothing$ for $i\neq j$ ( $i,j = 1,2,\cdots,n$ ) (pair - wise mutual exclusivity).
- $\bigcup_{i = 1}^{n}B_i=\Omega$ (union covers the whole sample space).
Formula: For any event $A\subseteq\Omega$ ,

$P(A)=\sum_{i = 1}^{n}P(B_i)P(A|B_i),~E(A)=\sum_{i = 1}^{n}P(B_i)E(A|B_i)$

Continuous Random Variables#

Given Conditions:
- $X$ is a continuous random variable with probability density function $f(x)$ .
- $Y$ is another random variable with values $y$ .
- Conditional probability density function of $X$ given $Y = y$ is $f_{X|Y}(x|y)$ .
- Marginal probability density function of $Y$ is $f_Y(y)$ .
Formula:

$f_X(x)=\int_{-\infty}^{\infty}f_Y(y)\cdot f_{X|Y}(x|y)dy,~E(X)=\int_{-\infty}^{\infty}f_Y(y)\cdot E(X|Y = y)dy$

State and State space#

$s$ can include multi - dimensional coordinates, such as velocity, level, temperature and various other states.

State: The status of the agent with respect to the environment.
State space: The set of all states, denoted as $\mathcal{S} = \{s_i\}_{i = 1}^9$ .

alt text

Action and Action space#

Action：For each state, there are five possible actions: $a_1,\cdots,a_5$
- $a_1$ ：move upwards, $a_2$ ：move rightwards;
- $a_3$ ：move downwards, $a_4$ ：move leftwards;
- $a_5$ ：stay unchanged;
Action space: the set of all possible actions of a state, $\mathcal{A}(s_i)=\{a_i\}_{i = 1}^5$

State Transition Probability#

State transition defines the interaction with the environment.

State Transition: When an agent takes an action and moves from one state to another, this process is called state transition.
If action $a_2$ (move rightwards) is chosen at state $s_1$ , the next state is $s_2$ , denoted as $s_1\xrightarrow{a_2}s_2.$
If action $a_1$ (move upwards) is chosen at state $s_1$ , the next state remains $s_1$ , denoted as $s_1\xrightarrow{a_1}s_1.$
State Transition Probability: Use probability to describe state transition.
- $p(s_2|s_1, a_2) = 1$ , which means the probability of transitioning from state $s_1$ to state $s_2$ when action $a_2$ is taken is $1$ .
- $p(s_i|s_1, a_2) = 0$ for all $i\neq2$ , that is, the probability of transitioning from state $s_1$ to any state other than $s_2$ when taking action $a_2$ is $0$ .
This is a deterministic case. State transition can also be stochastic, for example, $p(s_1|s_1, a_2) = 0.5,p(s_5|s_1, a_2) = 0.5$ .

Policy#

Definition：the agent which actions to perform in a given state.
Intuitive Representation：Arrows in a grid - like illustration can be used to demonstrate a policy. For example, the arrows in the provided grid show the recommended actions from different states.

alt text

Mathematical Representation：Use conditional probability (Take state $s_1$ as an example)

Deterministic policy	Stochastic policy

Reward#

Definition: It is a real number obtained after an agent takes an action.
- A positive reward encourages taking such actions, while a negative reward punishes taking them.
- Zero reward implies no punishment. In some cases, a positive value can even represent punishment.
Representation Method：Use conditional probability for mathematical description.
Example at State $s_1$ : $p(r = - 1|s_1,a_1)=1$ and $p(r\neq - 1|s_1,a_1)=0$ .
Role of Reward: Reward can be regarded as a human - machine interface. It is used to guide the agent to behave as expected.
- Grid - world Example Rewards
  - If the agent tries to exit the boundary, $r_{bound}=-1$ .
  - If the agent attempts to enter a forbidden cell, $r_{forbid} = - 1$ .
  - When the agent reaches the target cell, $r_{target}=+1$ .
  - Otherwise, the agent gets a reward of $r = 0$ .

Trajectory and Return#

Trajectory: A trajectory is a chain of state - action - reward. It shows the path an agent takes, like a sequence of states it visits, actions it takes, and rewards it gets.
Return: Also called total rewards or cumulative rewards. It’s the sum of all the rewards collected along a trajectory.
Example Left：starting from $s_1$ ,
- The trajectory: $s_1\xrightarrow[a_2]{r = 0}s_2\xrightarrow[a_3]{r = 0}s_5\xrightarrow[a_3]{r = 0}s_8\xrightarrow[a_2]{r = 1}s_9.$
- Return = $0 + 0+0 + 1=1$ .
Example Right：starting from $s_1$ ,
- The trajectory: $s_1\xrightarrow[a_3]{r = 0}s_4\xrightarrow[a_3]{r=-1}s_7\xrightarrow[a_2]{r = 0}s_8\xrightarrow[a_2]{r = 1}s_9.$
- Return = $0-1 + 0+1 = 0$ .

Infinite Trajectories and Divergence Problem#

Infinite Trajectory：Suppose a policy generates an infinitely long trajectory like

s_1\xrightarrow[a_2]{r = 0}s_2\xrightarrow[a_3]{r = 0}s_5\xrightarrow[a_3]{r = 0}s_8\xrightarrow[a_2]{r = 1}s_9\xrightarrow[a_5]{r = 1}s_9\xrightarrow[a_5]{r = 1}s_9\cdots

If we calculate the return as the direct sum of rewards $0 + 0+0 + 1+1 + 1+\cdots=\infty$
This makes it impossible to properly evaluate the policy using this simple sum.
To deal with infinitely long trajectories, we introduce the concept of discounted return.
$\text{Discounted return} = 0+\gamma0+\gamma^{2}0+\gamma^{3}1+\gamma^{4}1+\gamma^{5}1+\cdots=\gamma^{3}\frac{1}{1 - \gamma},$
where $\gamma\in(0, 1)$ is the discount rate.

If $\gamma$ is close to $0$ , the value of the discounted return is mainly determined by the rewards received in the near future.
If $\gamma$ is close to $1$ , the value of the discounted return is mainly determined by the rewards received in the far future.

Markov decision process#

Key elements of MDP:

Sets:
- State: the set of states $\mathcal{S}$
- Action: the set of actions $\mathcal{A}(s)$ is associated for state $s\in\mathcal{S}$ .
- Reward: the set of rewards $\mathcal{R}(s, a)$ .
Probability distribution:
- State transition probability: at state $s$ , taking action $a$ , the probability to transit to state $s'$ is $p(s'|s, a)$
- Reward probability: at state $s$ , taking action $a$ , the probability to get reward $r$ is $p(r|s, a)$
Policy: at state $s$ , the probability to choose action $a$ is $\pi(a|s)$
Markov property: memoryless property
- $p(s_{t + 1}|a_{t + 1}, s_t, \dots, a_1, s_0)=p(s_{t + 1}|a_{t + 1}, s_t)$
- $p(r_{t + 1}|a_{t + 1}, s_t, \dots, a_1, s_0)=p(r_{t + 1}|a_{t + 1}, s_t)$

All the concepts introduced in this lecture can be put in the framework in MDP.

State Values and Bellman Equation#

Why are returns important?#

In fact, returns play a fundamental role in reinforcement learning since they can evaluate whether a policy is good or not.

Following the first policy, the trajectory is $s_1\rightarrow s_3\rightarrow s_4\rightarrow s_4\cdots$ . The corresponding discounted return is

\begin{align*} \text{return}_1&=0 + \gamma1+\gamma^{2}1+\cdots\\ &=\gamma(1 + \gamma+\gamma^{2}+\cdots)\\ &=\frac{\gamma}{1 - \gamma}. \end{align*}

Following the second policy, the trajectory is $s_1\rightarrow s_2\rightarrow s_4\rightarrow s_4\cdots$ . The discounted return is

\begin{align*} \text{return}_2&=- 1+\gamma1+\gamma^{2}1+\cdots\\ &=-1+\gamma(1 + \gamma+\gamma^{2}+\cdots)\\ &=-1+\frac{\gamma}{1 - \gamma}. \end{align*}

Following the third policy, two trajectories can possibly be obtained. One is $s_1\rightarrow s_3\rightarrow s_4\rightarrow s_4\cdots$ , and the other is $s_1\rightarrow s_2\rightarrow s_4\rightarrow s_4\cdots$ . The probability of either of the two trajectories is $0.5$ . Then, the average return that can be obtained starting from $s_1$ is

\begin{align*} \text{return}_3&=0.5\left(-1+\frac{\gamma}{1 - \gamma}\right)+0.5\left(\frac{\gamma}{1 - \gamma}\right)\\ &=-0.5+\frac{\gamma}{1 - \gamma}. \end{align*}

By comparing the returns of the three policies, we notice that
$\text{return}_1>\text{return}_3>\text{return}_2.$

How to calculate returns?#

A return equals the discounted sum of all the rewards collected along a Let $v_i$ denote the return obtained by starting from $s_i$ for $i = 1,2,3,4$ .

By definition#

Let $v_i$ denote the return obtained starting from $s_i$ ( $i = 1,2,3,4$ )

\begin{gather*} v_1 = r_1+\gamma r_2+\gamma^{2}r_3+\cdots;\\ v_2 = r_2+\gamma r_3+\gamma^{2}r_4+\cdots;\\ v_3 = r_3+\gamma r_4+\gamma^{2}r_1+\cdots;\\ v_4 = r_4+\gamma r_1+\gamma^{2}r_2+\cdots. \end{gather*}

By substitution#

\begin{gather*} v_1 = r_1+\gamma(r_2+\gamma r_3+\cdots)=r_1+\gamma v_2;\\ v_2 = r_2+\gamma(r_3+\gamma r_4+\cdots)=r_2+\gamma v_3;\\ v_3 = r_3+\gamma(r_4+\gamma r_1+\cdots)=r_3+\gamma v_4;\\ v_4 = r_4+\gamma(r_1+\gamma r_2+\cdots)=r_4+\gamma v_1. \end{gather*}

Write in the following matrix - vector form:

\begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \end{bmatrix} = \begin{bmatrix} r_1 \\ r_2 \\ r_3 \\ r_4 \end{bmatrix} + \begin{bmatrix} \gamma v_2 \\ \gamma v_3 \\ \gamma v_4 \\ \gamma v_1 \end{bmatrix} = \begin{bmatrix} r_1 \\ r_2 \\ r_3 \\ r_4 \end{bmatrix} + \gamma \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \end{bmatrix}

which can be rewritten as $\mathbf{v}=\mathbf{r}+\gamma\mathbf{Pv}.$ We can solve for $v$ as $\mathbf{v}=(I - \gamma \mathbf{P})^{-1}\mathbf{r}.$

State value#

Single step Process#

Elements
- $t,t + 1$ : Discrete time instances.
- $S_t$ : State at time $t$ .
- $A_t$ : The action taken at state $S_t$ .
- $R_{t + 1}$ : The reward obtained after taking $A_t$ .
- $S_{t + 1}$ : The state transitioned to after taking $A_t$ .
- Note that $S_t$ , $A_t$ , $R_{t + 1}$ are all random variables.
Probability Distributions
- $S_t\rightarrow A_t$ is governed by $\pi(A_t = a|S_t = s)$ .
- $S_t,A_t\rightarrow R_{t + 1}$ is governed by $p(R_{t + 1}=r|S_t = s,A_t = a)$ .
- $S_t,A_t\rightarrow S_{t + 1}$ is governed by $p(S_{t + 1}=s'|S_t = s,A_t = a)$ .
- Assume we know the model (i.e., the probability distributions).

Multi step Trajectory#

Discounted Return ( $G_t$ )
- Formula: $G_t=R_{t + 1}+\gamma R_{t + 2}+\gamma^{2}R_{t + 3}+\cdots$ , where $\gamma\in[0,1)$ is the discount rate.
- $G_t$ is a random variable since $R_{t + 1}$ , $R_{t + 2},\cdots$ are random variables.

State value Function#

Definition
- $v_{\pi}(s)=\mathbb{E}[G_t|S_t = s]$ , which is the expectation (expected value or mean) of $G_t$ .
Remarks
- It is a function of $s$ , a conditional expectation given that the state starts from $s$ .
- It depends on the policy $\pi$ . Different policies may result in different state values.
- Represents the “value” of a state. A larger state value indicates a better policy as it implies greater cumulative rewards can be obtained.
Relationship between Return and State Value
- The state value is the mean of all possible returns starting from a state. If $\pi(a|s)$ , $p(r|s,a)$ , $p(s'|s,a)$ are deterministic, then the state value is the same as the return.
Example

\begin{align*} &v_{\pi_1}(s_1)=0+\gamma1+\gamma^{2}1+\cdots=\gamma(1 + \gamma+\gamma^{2}+\cdots)=\frac{\gamma}{1 - \gamma}\\ &v_{\pi_2}(s_1)=-1+\gamma1+\gamma^{2}1+\cdots=-1+\gamma(1 + \gamma+\gamma^{2}+\cdots)=-1+\frac{\gamma}{1 - \gamma}\\ &v_{\pi_3}(s_1)=0.5\left(-1+\frac{\gamma}{1 - \gamma}\right)+0.5\left(\frac{\gamma}{1 - \gamma}\right)=-0.5+\frac{\gamma}{1 - \gamma} \end{align*}

Bellman equation#

Consider a random trajectory: $S_t\xrightarrow{A_t}R_{t + 1},S_{t + 1}\xrightarrow{A_{t + 1}}R_{t + 2},S_{t + 2}\xrightarrow{A_{t + 2}}R_{t + 3},\cdots$

The return $G_t$ can be written as

\begin{align*} G_t&=R_{t + 1}+\gamma R_{t + 2}+\gamma^{2}R_{t + 3}+\cdots\\ &=R_{t + 1}+\gamma(R_{t + 2}+\gamma R_{t + 3}+\cdots)\\ &=R_{t + 1}+\gamma G_{t + 1} \end{align*}

Then, it follows from the definition of the state value that

\begin{align*} v_{\pi}(s)&=\mathbb{E}[G_t|S_t = s]\\ &=\mathbb{E}[R_{t + 1}+\gamma G_{t + 1}|S_t = s]\\ &=\mathbb{E}[R_{t + 1}|S_t = s]+\gamma\mathbb{E}[G_{t + 1}|S_t = s] \end{align*}

First, calculate the first term $\mathbb{E}[R_{t + 1}|S_t = s]$

\begin{align*} \mathbb{E}[R_{t + 1}|S_t = s]&=\sum_{a}\pi(a|s)\mathbb{E}[R_{t + 1}|S_t = s,A_t = a]~~\text{(当前状态选择任一状态的策略概率)}\\ &=\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r~~\text{(当前状态和动作到任一状态的回报概率)} \end{align*}

Second, calculate the second term $\mathbb{E}[G_{t + 1}|S_t = s]$

\begin{align*} \mathbb{E}[G_{t + 1}|S_t = s]&=\sum_{s'}\mathbb{E}[G_{t + 1}|S_t = s,S_{t + 1}=s']p(s'|s)~~\text{(当前状态到任一状态的概率)}\\ &=\sum_{s'}\mathbb{E}[G_{t + 1}|S_{t + 1}=s']p(s'|s)~~\text{(Markov property)}\\ &=\sum_{s'}v_{\pi}(s')p(s'|s)\\ &=\sum_{s'}v_{\pi}(s')\sum_{a}p(s'|s,a)\pi(a|s)~~\text{(当前状态到任一状态的策略概率)} \end{align*}

Therefore, we have

\begin{align*} v_{\pi}(s)&=\mathbb{E}[R_{t + 1}|S_t = s]+\gamma\mathbb{E}[G_{t + 1}|S_t = s],\\ &=\underbrace{\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r}_{\text{mean of immediate rewards}}+\gamma\underbrace{\sum_{a}\pi(a|s)\sum_{s'}p(s'|s,a)v_{\pi}(s')}_{\text{mean of future rewards}},\\ &=\sum_{a}\pi(a|s)\left[\sum_{r}p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v_{\pi}(s')\right],\quad\forall s\in\mathcal{S}. \end{align*}

Every state has an equation like this!!!

Example 1

Consider the state value of $s_1$ :

$\pi(a = a_3|s_1)=1$ and $\pi(a\neq a_3|s_1)=0$ .
$p(s' = s_3|s_1,a_3)=1$ and $p(s'\neq s_3|s_1,a_3)=0$ .
$p(r = 0|s_1,a_3)=1$ and $p(r\neq 0|s_1,a_3)=0$ .

Substituting them into the Bellman equation gives and similarly, we have

\begin{gather*} v_{\pi}(s_1)=0+\gamma v_{\pi}(s_3),~~v_{\pi}(s_2)=1+\gamma v_{\pi}(s_4),\\ v_{\pi}(s_3)=1+\gamma v_{\pi}(s_4),~~v_{\pi}(s_4)=1+\gamma v_{\pi}(s_4). \end{gather*}

Solve the above equations one by one from the last to the first:

\begin{gather*} v_{\pi}(s_4)=\frac{1}{1 - \gamma},~~v_{\pi}(s_3)=\frac{1}{1 - \gamma},\\ v_{\pi}(s_2)=\frac{1}{1 - \gamma},~~v_{\pi}(s_1)=\frac{\gamma}{1 - \gamma}. \end{gather*}

Substituting $\gamma = 0.9$ yields

$v_{\pi}(s_4)=10,~v_{\pi}(s_3)=10,~v_{\pi}(s_2)=10,~v_{\pi}(s_1)=9$ .

Example 2

We have

\begin{align*} &v_{\pi}(s_1)=0.5[0+\gamma v_{\pi}(s_3)] + 0.5[-1+\gamma v_{\pi}(s_2)],~v_{\pi}(s_2)=1+\gamma v_{\pi}(s_4),\\ &v_{\pi}(s_3)=1+\gamma v_{\pi}(s_4),~v_{\pi}(s_4)=1+\gamma v_{\pi}(s_4)\\ \end{align*}

Solve the above equations one by one from the last to the first.

\begin{gather*} v_{\pi}(s_4)=\frac{1}{1 - \gamma},~v_{\pi}(s_3)=\frac{1}{1 - \gamma},~v_{\pi}(s_2)=\frac{1}{1 - \gamma},\\ v_{\pi}(s_1)=0.5[0+\gamma v_{\pi}(s_3)] + 0.5[-1+\gamma v_{\pi}(s_2)]=-0.5+\frac{\gamma}{1 - \gamma}. \end{gather*}

Substituting $\gamma = 0.9$ yields

$v_{\pi}(s_4)=10$ , $v_{\pi}(s_3)=10$ , $v_{\pi}(s_2)=10$ , $v_{\pi}(s_1)=-0.5 + 9=8.5$ .

Matrix vector form#

Recall that:

v_{\pi}(s)=\sum_{a}\pi(a|s)\left[\sum_{r}p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v_{\pi}(s')\right]

Rewrite the Bellman equation as

v_{\pi}(s)=r_{\pi}(s)+\gamma\sum_{s'}p_{\pi}(s'|s)v_{\pi}(s')

where

r_{\pi}(s)\triangleq\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r,\quad p_{\pi}(s'|s)\triangleq\sum_{a}\pi(a|s)p(s'|s,a)

Suppose the states could be indexed as $s_i$ ( $i = 1,\cdots,n$ ).

For state $s_i$ , the Bellman equation is

v_{\pi}(s_i)=r_{\pi}(s_i)+\gamma\sum_{s_j}p_{\pi}(s_j|s_i)v_{\pi}(s_j)

Put all these equations for all the states together and rewrite to a matrix - vector form

v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}

where

$v_{\pi}=[v_{\pi}(s_1),\cdots,v_{\pi}(s_n)]^{T}\in\mathbb{R}^{n}$
$r_{\pi}=[r_{\pi}(s_1),\cdots,r_{\pi}(s_n)]^{T}\in\mathbb{R}^{n}$
$P_{\pi}\in\mathbb{R}^{n\times n}$ , where $[P_{\pi}]_{ij}=p_{\pi}(s_j|s_i)$ , is the state transition matrix

If there are four states, $v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}$ can be written out as

\begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} = \begin{bmatrix} r_{\pi}(s_1)\\ r_{\pi}(s_2)\\ r_{\pi}(s_3)\\ r_{\pi}(s_4) \end{bmatrix} +\gamma \begin{bmatrix} p_{\pi}(s_1|s_1)&p_{\pi}(s_2|s_1)&p_{\pi}(s_3|s_1)&p_{\pi}(s_4|s_1)\\ p_{\pi}(s_1|s_2)&p_{\pi}(s_2|s_2)&p_{\pi}(s_3|s_2)&p_{\pi}(s_4|s_2)\\ p_{\pi}(s_1|s_3)&p_{\pi}(s_2|s_3)&p_{\pi}(s_3|s_3)&p_{\pi}(s_4|s_3)\\ p_{\pi}(s_1|s_4)&p_{\pi}(s_2|s_4)&p_{\pi}(s_3|s_4)&p_{\pi}(s_4|s_4) \end{bmatrix} \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix}

Example

For this specific example:

\begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} = \begin{bmatrix} 0\\ 1\\ 1\\ 1 \end{bmatrix} +\gamma \begin{bmatrix} 0&0&1&0\\ 0&0&0&1\\ 0&0&0&1\\ 0&0&0&1 \end{bmatrix} \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix}

Example

For this specific example:

\begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix} = \begin{bmatrix} 0.5(0)+0.5(-1)\\ 1\\ 1\\ 1 \end{bmatrix} +\gamma \begin{bmatrix} 0&0.5&0.5&0\\ 0&0&0&1\\ 0&0&0&1\\ 0&0&0&1 \end{bmatrix} \begin{bmatrix} v_{\pi}(s_1)\\ v_{\pi}(s_2)\\ v_{\pi}(s_3)\\ v_{\pi}(s_4) \end{bmatrix}

Solve state values#

The Bellman equation in matrix vector form is

v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}\Longrightarrow v_{\pi}=(I - \gamma P_{\pi})^{-1}r_{\pi}

In practice, we still need to use numerical tools to calculate the matrix inverse.

Can we avoid the matrix inverse operation? Yes, by iterative algorithms.

An iterative solution is:
$v_{k + 1}=r_{\pi}+\gamma P_{\pi}v_{k}.$
This algorithm leads to a sequence $\{v_0, v_1, v_2,\cdots\}$ . We can show that
$v_{k}\rightarrow v_{\pi}=(I - \gamma P_{\pi})^{-1}r_{\pi},\quad k\rightarrow\infty.$

Proof. Define the error as $\delta_{k}=v_{k}-v_{\pi}$ . We only need to show $\delta_{k}\rightarrow0$ . Substituting $v_{k + 1}=\delta_{k + 1}+v_{\pi}$ and $v_{k}=\delta_{k}+v_{\pi}$ into $v_{k + 1}=r_{\pi}+\gamma P_{\pi}v_{k}$ gives

\delta_{k + 1}+v_{\pi}=r_{\pi}+\gamma P_{\pi}(\delta_{k}+v_{\pi}),

which can be rewritten as

\delta_{k + 1}=-v_{\pi}+r_{\pi}+\gamma P_{\pi}\delta_{k}+\gamma P_{\pi}v_{\pi}=\gamma P_{\pi}\delta_{k}.

As a result,

\delta_{k + 1}=\gamma P_{\pi}\delta_{k}=\gamma^{2}P_{\pi}^{2}\delta_{k - 1}=\cdots=\gamma^{k + 1}P_{\pi}^{k + 1}\delta_{0}.

Note that $0\leq P_{\pi}^{k}\leq1$ , which means every entry of $P_{\pi}^{k}$ is no greater than 1 for any $k = 0,1,2,\cdots$ .

That is because $P_{\pi}^{k}\mathbf{1}=\mathbf{1}$ , where $\mathbf{1}=[1,\cdots,1]^{T}$ . On the other hand, since $\gamma < 1$ , we know $\gamma^{k}\rightarrow0$ and hence $\delta_{k + 1}=\gamma^{k + 1}P_{\pi}^{k + 1}\delta_{0}\rightarrow0$ as $k\rightarrow\infty$ .

The iterative relationship between the value function and the Q function#

Value function
represents the value when the state variable takes the value of s at the initial moment
$V^{\pi}(s)=\mathbb{E}_{s_0 = s,a_h\sim\pi(\cdot|s_h),s_{h + 1}\sim p(\cdot|s_h,a_h)}\left[\sum_{h = 0}^{\infty}\gamma^{h}r(s_h,a_h)\right]$

Q function
represents the value when the state variable takes the value of s and the action variable takes the value of a at the initial moment
$Q^{\pi}(s,a)=\mathbb{E}_{s_0 = s,a_0 = a,s_{h + 1}\sim p(\cdot|s_h,a_h),a_{h + 1}\sim\pi(\cdot|s_{h+1})}\left[\sum_{h = 0}^{\infty}\gamma^{h}r(s_h,a_h)\right]$

The iterative relationship
$\begin{gather*} \mathrm{V}^{\pi}(\mathrm{s}) =\mathbb{E}_{\mathrm{a} \sim \pi(\cdot \mid \mathrm{s})}\left[\mathrm{Q}^{\pi}(\mathrm{s}, \mathrm{a})\right] & & (\mathrm{V} \text{-Q}) \\ \\ \mathrm{Q}^{\pi}(\mathrm{s}, \mathrm{a}) =\mathrm{r}(\mathrm{s}, \mathrm{a})+\gamma \mathbb{E}_{\mathrm{s}^{\prime} \sim \mathrm{p}(\cdot \mid \mathrm{s}, \mathrm{a})}\left[\mathrm{V}^{\pi}\left(\mathrm{s}^{\prime}\right)\right] & & (\mathrm{Q} \text{-V}) \\ \\ \mathrm{V}^{\pi}(\mathrm{s}) =\mathbb{E}_{\mathrm{a} \sim \pi(\cdot \mid \mathrm{s})}\left[\mathrm{r}(\mathrm{s}, \mathrm{a})+\gamma \mathbb{E}_{\mathrm{s}^{\prime} \sim \mathrm{p}(\cdot \mid \mathrm{s}, \mathrm{a})}\left[\mathrm{V}^{\pi}\left(\mathrm{s}^{\prime}\right)\right]\right] & & (\mathrm{V} \text{-V}) \\ \\ \mathrm{Q}^{\pi}(\mathrm{s}, \mathrm{a}) =\mathrm{r}(\mathrm{s}, \mathrm{a})+\gamma \mathbb{E}_{\mathrm{s}^{\prime} \sim \mathrm{p}(\cdot \mid \mathrm{s}, \mathrm{a})}\left[\mathbb{E}_{\mathrm{a}^{\prime} \sim \pi\left(\cdot \mid \mathrm{s}^{\prime}\right)}\left[\mathrm{Q}^{\pi}\left(\mathrm{s}^{\prime}, \mathrm{a}^{\prime}\right)\right]\right] & & (\mathrm{Q} \text{-Q}) \end{gather*}$

Action Value#

The action value of a state - action pair $(s, a)$ is defined as
$q_{\pi}(s,a)\doteq\mathbb{E}[G_t|S_t = s,A_t = a].$
It represents the expected return obtained after taking action $a$ at state $s$ .
It depends on the state - action pair $(s, a)$ rather than just an action.

The relationship between action values and state values

First, it follows from the properties of conditional expectation that

v_{\pi}(s)=\underbrace{\mathbb{E}[G_t|S_t = s]}_{v_{\pi}(s)}=\sum_{a\in\mathcal{A}}\underbrace{\mathbb{E}[G_t|S_t = s,A_t = a]}_{q_{\pi}(s,a)}\pi(a|s).

It then follows that $v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_{\pi}(s,a)$ As a result, a state value is the expectation of the action values associated with that state.

Second, since the state value is given by

v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right]

which leads to $q_{\pi}(s,a)=\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s').$

Example

\begin{gather*} q_{\pi}(s_1,a_1)&=-1+\gamma v_{\pi}(s_1),\\ q_{\pi}(s_1,a_3)&=0+\gamma v_{\pi}(s_3),\\ q_{\pi}(s_1,a_4)&=-1+\gamma v_{\pi}(s_1),\\ q_{\pi}(s_1,a_5)&=0+\gamma v_{\pi}(s_1). \end{gather*}

Bellman optimality equation#

Optimal#

The state value could be used to evaluate if a policy is good or not: if

$v_{\pi_1}(s)\geq v_{\pi_2}(s)\quad\text{for all }s\in\mathcal{S}$

then $\pi_1$ is “better” than $\pi_2$ .

A policy $\pi^*$ is optimal if $v_{\pi^*}(s)\geq v_{\pi}(s)$ for all $s$ and for any other policy $\pi$ .

For every $s\in\mathcal{S}$ , the elementwise expression of the BOE is

\begin{align*} v(s)&=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)\left(\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v(s')\right)\\ &=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)q(s,a), \end{align*}

where $v(s),v(s')$ are unknown variables to be solved and

$q(s,a)\doteq\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v(s')$

Bellman optimality equation (matrix - vector form):

$v = \max_{\pi}(r_{\pi}+\gamma P_{\pi}v)$

where the elements corresponding to $s$ or $s'$ are

$[r_{\pi}]_{s}\triangleq\sum_{a}\pi(a|s)\sum_{r}p(r|s,a)r,$

$[P_{\pi}]_{s,s'}=p(s'|s)\triangleq\sum_{a}\pi(a|s)\sum_{s'}p(s'|s,a).$

Here $\max_{\pi}$ is performed elementwise.

Example: Consider two variables $x,a\in\mathbb{R}$ . Suppose they satisfy
$x = \max_{a}(2x - 1 - a^{2})$

This equation has two unknowns. To solve them, first consider the right hand side. Regardless the value of $x$ , $\max_{a}(2x - 1 - a^{2})=2x - 1$ where the maximization is achieved when $a = 0$ . Second, when $a = 0$ , the equation becomes $x = 2x - 1$ , which leads to $x = 1$ . Therefore, $a = 0$ and $x = 1$ are the solution of the equation.

Example (How to solve $\max_{\pi}\sum_{a}\pi(a|s)q(s,a)$ ) Suppose $q_1,q_2,q_3\in\mathbb{R}$ are given. Find $c_1^*,c_2^*,c_3^*$ solving

$\max_{c_1,c_2,c_3}c_1q_1 + c_2q_2 + c_3q_3$

where $c_1 + c_2 + c_3 = 1$ and $c_1,c_2,c_3\geq0$ .

Without loss of generality, suppose $q_3\geq q_1,q_2$ . Then, the optimal solution is $c_3^* = 1$ and $c_1^* = c_2^* = 0$ . That is because for any $c_1,c_2,c_3$

$q_3=(c_1 + c_2 + c_3)q_3=c_1q_3 + c_2q_3 + c_3q_3\geq c_1q_1 + c_2q_2 + c_3q_3$

Maximization on the right hand side of BOE#

Fix $v'(s)$ first and solve $\pi$ :

\begin{align*} v(s)&=\max_{\pi}\sum_{a}\pi(a|s)\left(\sum_{r}p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v(s')\right),\quad\forall s\in\mathcal{S}\\ &=\max_{\pi}\sum_{a}\pi(a|s)q(s,a) \end{align*}

Inspired by the above example, considering that $\sum_{a}\pi(a|s)=1$ , we have

$\max_{\pi}\sum_{a}\pi(a|s)q(s,a)=\max_{a\in\mathcal{A}(s)}q(s,a)$

where the optimality is achieved when

\pi(a|s)= \begin{cases} 1 & a = a^*\\ 0 & a\neq a^* \end{cases}

where $a^*=\arg\max_{a}q(s,a)$ .

Matrix vector form of the BOE#

The BOE refers to a set of equations defined for all states. If we combine these equations, we can obtain a concise matrix - vector form, which will be extensively used in this chapter.

The matrix - vector form of the BOE is

$v=\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v),$

where $v\in\mathbb{R}^{|\mathcal{S}|}$ and $\max_{\pi}$ is performed in an elementwise manner. The structures of $r_{\pi}$ and $P_{\pi}$ are the same as those in the matrix - vector form of the normal Bellman equation:

[r_{\pi}]_{s}\doteq\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r,\quad[P_{\pi}]_{s,s'}=p(s'|s)\doteq\sum_{a\in\mathcal{A}}\pi(a|s)p(s'|s,a)

Since the optimal value of $\pi$ is determined by $v$ , the right - hand side of (3.2) is a function of $v$ , denoted as

$f(v)\doteq\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v)$

Then, the BOE can be expressed in a concise form as

$v = f(v)$

In the remainder of this section, we show how to solve this nonlinear equation.

Contraction property of the BOE#

Theorem: The function
$f(v)=\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v)$
on the right hand side of the BOE is a contraction mapping. In particular, for any $v_1,v_2\in\mathbb{R}^{|\mathcal{S}|}$ , it holds that
$\|f(v_1)-f(v_2)\|_{\infty}\leq\gamma\|v_1 - v_2\|_{\infty},$
where $\gamma\in(0,1)$ is the discount rate, and $\|\cdot\|_{\infty}$ is the maximum norm, which is the maximum absolute value of the elements of a vector.

Proof. Consider any two vectors $v_1,v_2\in\mathbb{R}^{|\mathcal{S}|}$ , and suppose that

\pi_1^*\doteq\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_1)

and $\pi_2^*\doteq\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_2).$ Then,

\begin{align*} f(v_1)&=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_1)=r_{\pi_1^*}+\gamma P_{\pi_1^*}v_1\geq r_{\pi_2^*}+\gamma P_{\pi_2^*}v_1,\\ f(v_2)&=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_2)=r_{\pi_2^*}+\gamma P_{\pi_2^*}v_2\geq r_{\pi_1^*}+\gamma P_{\pi_1^*}v_2, \end{align*}

As a result,

\begin{align*} f(v_1)-f(v_2)&=r_{\pi_1^*}+\gamma P_{\pi_1^*}v_1-(r_{\pi_2^*}+\gamma P_{\pi_2^*}v_2)\\ &\leq r_{\pi_1^*}+\gamma P_{\pi_1^*}v_1-(r_{\pi_1^*}+\gamma P_{\pi_1^*}v_2)\\ &=\gamma P_{\pi_1^*}(v_1 - v_2). \end{align*}

Similarly, it can be shown that $f(v_2)-f(v_1)\leq\gamma P_{\pi_2^*}(v_2 - v_1)$ . Therefore,

\gamma P_{\pi_2^*}(v_1 - v_2)\leq f(v_1)-f(v_2)\leq\gamma P_{\pi_1^*}(v_1 - v_2).

Define

z\doteq\max\{|\gamma P_{\pi_2^*}(v_1 - v_2)|,|\gamma P_{\pi_1^*}(v_1 - v_2)|\}\in\mathbb{R}^{|\mathcal{S}|},

where $\max(\cdot)$ , $|\cdot|$ , and $\geq$ are all elementwise operators. By definition, $z\geq0$ . On one hand, it is easy to see that

-z\leq\gamma P_{\pi_2^*}(v_1 - v_2)\leq f(v_1)-f(v_2)\leq\gamma P_{\pi_1^*}(v_1 - v_2)\leq z,

which implies

|f(v_1)-f(v_2)|\leq z.

It then follows that

\|f(v_1)-f(v_2)\|_{\infty}\leq\|z\|_{\infty},

where $\|\cdot\|_{\infty}$ is the maximum norm.

On the other hand, suppose that $z_i$ is the $i$ th entry of $z$ , and $p_i^T$ and $q_i^T$ are the $i$ th row of $P_{\pi_1^*}$ and $P_{\pi_2^*}$ , respectively. Then,

z_i=\max\{\gamma|p_i^T(v_1 - v_2)|,\gamma|q_i^T(v_1 - v_2)|\}.

Since $p_i$ is a vector with all nonnegative elements and the sum of the elements is equal to one, it follows that

|p_i^T(v_1 - v_2)|\leq p_i^T\cdot|v_1 - v_2|\leq\|v_1 - v_2\|_{\infty}.

Similarly, we have $|q_i^T(v_1 - v_2)|\leq\|v_1 - v_2\|_{\infty}$ . Therefore, $z_i\leq\gamma\|v_1 - v_2\|_{\infty}$ and hence

\|z\|_{\infty}=\max_{i}|z_i|\leq\gamma\|v_1 - v_2\|_{\infty}.

Substituting this inequality to (3.5) gives

\|f(v_1)-f(v_2)\|_{\infty}\leq\gamma\|v_1 - v_2\|_{\infty},

which concludes the proof of the contraction property of $f(v)$ .

Theorem (Existence, Uniqueness, and Algorithm)
For the BOE $v = f(v)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v)$ , there always exists a solution $v^*$ and the solution is unique. The solution could be solved iteratively by
$v_{k + 1}=f(v_k)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_k)$
This sequence $\{v_k\}$ converges to $v^*$ exponentially fast given any initial guess $v_0$ . The convergence rate is determined by $\gamma$ .

Optimal policy#

Suppose $v^*$ is the solution to the Bellman optimality equation. It satisfies

v^*=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*)

Suppose

\pi^*=\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*)

Then

v^*=r_{\pi^*}+\gamma P_{\pi^*}v^*

Therefore, $\pi^*$ is a policy and $v^* = v_{\pi^*}$ is the corresponding state value.

Theorem 3.4 (Optimality of $v^*$ and $\pi^*$ )
The solution $v^*$ is the optimal state value, and $\pi^*$ is an optimal policy. That is, for any policy $\pi$ , it holds that
$v^* = v_{\pi^*}\geq v_{\pi},$
where $v_{\pi}$ is the state value of $\pi$ , and $\geq$ is an elementwise comparison.

Proof. For any policy $\pi$ , it holds that

$v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}.$

Since

$v^*=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*)=r_{\pi^*}+\gamma P_{\pi^*}v^*\geq r_{\pi}+\gamma P_{\pi}v^*,$

we have

$v^*-v_{\pi}\geq(r_{\pi}+\gamma P_{\pi}v^*)-(r_{\pi}+\gamma P_{\pi}v_{\pi})=\gamma P_{\pi}(v^*-v_{\pi}).$

Repeatedly applying the above inequality gives

v^*-v_{\pi}\geq\gamma P_{\pi}(v^*-v_{\pi})\geq\gamma^{2}P_{\pi}^{2}(v^*-v_{\pi})\geq\cdots\geq\gamma^{n}P_{\pi}^{n}(v^*-v_{\pi}).

It follows that

v^*-v_{\pi}\geq\lim_{n\rightarrow\infty}\gamma^{n}P_{\pi}^{n}(v^*-v_{\pi}) = 0,

where the last equality is true because $\gamma<1$ and $P_{\pi}^{n}$ is a nonnegative matrix with all its elements less than or equal to $1$ (because $P_{\pi}^{n}\mathbf{1}=\mathbf{1}$ ). Therefore, $v^*\geq v_{\pi}$ for any $\pi$ .

Theorem (Greedy optimal policy)
For any $s\in\mathcal{S}$ , the deterministic greedy policy
$\pi^*(a|s)= \begin{cases} 1, & a = a^*(s)\\ 0, & a\neq a^*(s) \end{cases}$
is an optimal policy for solving the BOE. Here,
$a^*(s)=\arg\max_{a}q^*(a,s),$
where
$q^*(s,a)\doteq\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v^*(s')$

Proof.

While the matrix - vector form of the optimal policy is $\pi^*=\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*)$ , its elementwise form is

\pi^*(s)=\arg\max_{\pi\in\Pi}\sum_{a\in\mathcal{A}}\pi(a|s)\underbrace{\left(\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v^*(s')\right)}_{q^*(s,a)},\quad s\in\mathcal{S}

It is clear that $\sum_{a\in\mathcal{A}}\pi(a|s)q^*(s,a)$ is maximized if $\pi(s)$ selects the action with the greatest $q^*(s,a)$ .

Factors that influence optimal policies#

The BOE is a powerful tool for analyzing optimal policies. We next apply the BOE to study what factors can influence optimal policies. This question can be easily answered by observing the elementwise expression of the BOE:

v(s)=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a | s)\left(\sum_{r\in\mathcal{R}}p(r | s, a)r+\gamma\sum_{s'\in\mathcal{S}}p(s' | s, a)v(s')\right),\quad s\in\mathcal{S}.

The optimal state value and optimal policy are determined by the following parameters:

the immediate reward $r$ ,
the discount rate $\gamma$ ,
the system model $p(s' | s, a), p(r | s, a)$ .

While the system model is fixed, we next discuss how the optimal policy varies when we change the values of $r$ and $\gamma$ .

Impact of the discount rate#

alt text

Impact of the reward values#

Theorem (Optimal policy invariance)
Consider a Markov decision process with $v^*\in\mathbb{R}^{|\mathcal{S}|}$ as the optimal state value satisfying
$v^* = \max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v^*).$
If every reward $r\in\mathcal{R}$ is changed by an affine transformation to $\alpha r+\beta$ , where $\alpha,\beta\in\mathbb{R}$ and $\alpha > 0$ , then the corresponding optimal state value $v'$ is also an affine transformation of $v^*$ :
$v'=\alpha v^*+\frac{\beta}{1 - \gamma}\mathbf{1}$
where $\gamma\in(0,1)$ is the discount rate and $\mathbf{1} = [1,\ldots,1]^T$ .
Consequently, the optimal policy derived from $v'$ is invariant to the affine transformation of the reward values.

Proof. For any policy $\pi$ , define $r_{\pi}=[\ldots,r_{\pi}(s),\ldots]^T$ where $r_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r,\quad s\in\mathcal{S}.$ If $r\to\alpha r+\beta$ , then $r_{\pi}(s)\to\alpha r_{\pi}(s)+\beta$ and hence $r_{\pi}\to\alpha r_{\pi}+\beta\mathbf{1}$ , where $\mathbf{1}=[1,\ldots,1]^T$ . In this case, the BOE becomes

v'=\max_{\pi\in\Pi}(\alpha r_{\pi}+\beta\mathbf{1}+\gamma P_{\pi}v').

We next solve the new BOE by showing that $v'=\alpha v^* + c\mathbf{1}$ with $c = \beta/(1 - \gamma)$ is a solution. In particular, substituting $v'=\alpha v^* + c\mathbf{1}$ gives

\alpha v^*+c\mathbf{1}=\max_{\pi\in\Pi}(\alpha r_{\pi}+\beta\mathbf{1}+\gamma P_{\pi}(\alpha v^* + c\mathbf{1}))=\max_{\pi\in\Pi}(\alpha r_{\pi}+\beta\mathbf{1}+\alpha\gamma P_{\pi}v^*+\gamma c\mathbf{1}),

where the last equality is due to the fact that $P_{\pi}\mathbf{1}=\mathbf{1}$ . The above equation can be reorganized as $\alpha v^*=\max_{\pi\in\Pi}(\alpha r_{\pi}+\alpha\gamma P_{\pi}v^*)+\beta\mathbf{1}+\gamma c\mathbf{1}-c\mathbf{1},$ which is equivalent to

\beta\mathbf{1}+\gamma c\mathbf{1}-c\mathbf{1}=0.

Since $c = \beta/(1 - \gamma)$ , the above equation is valid and hence $v'=\alpha v^*+c\mathbf{1}$ is the solution.

Since is the BOE, $v'$ is also the unique solution. Finally, since $v'$ is an affine transformation of $v^*$ , the relative relationships between the action values remain the same.

Hence, the greedy optimal policy derived from $v'$ is the same as that from $v^*$ : $\arg\max_{\pi\in\Pi}(r_{\pi}+\gamma P_{\pi}v')$ is the same as $\arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^*)$ .

Value Iteration and Policy Iteration#

Value iteration algorithm#

alt text

Policy iteration algorithm#

alt text

Truncated policy iteration algorithm#

alt text