[Paper Note] Risk-sensitive Markov decision processes

1 minute read

Published:

Howard, R. A., & Matheson, J. E. (1972). Risk-sensitive Markov decision processes. Management science, 18(7), 356-369.

This paper provides formulation and algorithm for the maximization of certain equivalent reward (utility) generated by a MDP.

  • The eigenvalue of $Q$, $\lambda$, may (?) be used to elicitated risk attitude $\gamma$

Definitions and notations

$\tilde{v}$ is the certain equivalent of a lottery with outcome $v$, $u(v)=u(\tilde{v})$

Denote $u_i(n)=u(\tilde{v}_{i}(n))$ be the utility of the reward process when it occupies state $i$ and there are $n$ periods left.

Assumption

Assume delta property, that when all prizes in a lottery increases by the same amount, certain equivalent increases by the same amount.

  • To ensure this property, the utility function has to be either linear or exponential
\[u(v)=-(sgn\gamma)e^{-\gamma v}\] \[u^{-1}(x)=-\frac{1}{\gamma}\ln(-(sgn\gamma)x)\]

The exponential utility implies a constant increase in lottery will become a multiplicative term \(u(v+\Delta)=e^{-\gamma \Delta} u(v)\)

Formulation

Reward process

\[\begin{equation} \begin{split} u(\tilde{v}_i(n+1))&=\sum\limits_{j\in S}p_{ij}(n+1) u(r_{ij}(n+1)+\tilde{v}_{j}(n)) \\ &= \sum\limits_{j\in S}p_{ij}(n+1)e^{-\gamma r_{ij}(n+1)} u(\tilde{v}_{j}(n)) \end{split} \end{equation}\]

Stationary MRP

Equivalently, we can use matrix notation, $Q_{ij}=p_{ij}e^{-\gamma r_{ij}}$

\(\text{u}(n)=Q^{n}\text{u}(0)\) We can show that, \(\lim_{n\rightarrow \infty}[\tilde{v}_{i}(n)-n(- \frac{1}{\gamma} \ln\lambda)]=\tilde{v}_{i}+c\)

  • $\lambda$ is the first eigen value of $Q$
  • $\tilde{v}_{i}$ is the relative certain equivalent
  • $\tilde{g}=- \frac{1}{\gamma} \ln\lambda$ is the certain equivalent gain

We can re-write the consistency condition

\[\lambda u_{i}= \sum\limits_{j\in S} q_{ij}u_{j}\]

Decision process

\(u_{i}(n+1)=\max_{k}\sum\limits_{j} p_{ij}^{k}(n+1)e^{-\gamma r^{k}_{ij}(n)} u_j(n)\)

  • this equation allows us to compute the optimal policy and the optimal utility
  • We can also find $\tilde{v}_{i}(n)$, the certain equivalent of the lottery implied by being in state $i$ with $n$ stages left

Algorithm for stationary policy

policy evaluation: for the present policy, solve

\[e^{-\gamma (\tilde{g}+\tilde{v}_{i})} = \sum\limits_{j\in S} p_{ij} \cdot e^{-\gamma (r_{ij}+\tilde{v}_j)}\]

with $\tilde{v}_{N}=0$, for certain equivalent gain $\tilde{g}$ and the relative certain equivalents.

policy improvement: For each state $i$ find the alternative $k$ that maximizes the certain equivalent

\[\tilde{V}_{i}^{k}=- \frac{1}{\gamma}\ln [\sum\limits_{j} p_{ij}^{k}e^{-\gamma (r_{ij}^{k} +\tilde{v}_{j})}]\]