[Paper Note] A utility criterion for Markov decision processes
Published:
Jaquette, S. C. (1976). A utility criterion for Markov decision processes. Management Science, 23(1), 43-49.
Utility criterion
The following utility criterion is used \(u(X)=-\exp(-\sum\limits_{t}\lambda_{t}x_{t}),\) where $\lambda_{t}= \lambda \beta^{t}$, reflecting time discount and constant risk averse.
- The function form is log-additive (pollack, 1967)
- It also satisfies risk independence
- The formulation can be considered also as a utility function of the total reward, discounted to the current period
Utility optimality
A policy $\pi=(f_1,…,f_t,…)$ is a sequence of vectors. $f_t=\pi_t(s_t)$ maps from states to actions.
- A policy $\pi^{*}$ is utility optimal with constant risk aversion $\lambda$ if $u_{\pi^{*}}(\lambda)\geq u_{\pi}(\lambda)$ for all $\pi$.
- A policy $\pi=(f_1,…,f_t,…)$ is ultimately stationary if $f_t=f$ for all $t$ larger than some finite integer
Thm. A utility optimal policy $\pi^{*}(\lambda,\beta)$ exists for all $\lambda\geq 0$ and all $0\leq\beta<1$ which is ultimately stationary. The policy $\pi^{*}$ can be chosen as a piecewise constant function of $\lambda$ and $\beta$, with an ultimate action vector $f$ a piecewise constant function of $\beta$ only (and constant for all $\beta$ close to 1).
- stationary policies are not utility optimal generally.
