[Paper Note] Linearly parameterized bandits

2 minute read

Published: March 07, 2024

Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2), 395-411.

The key ideas for finding bounds is to

attribute regret to the causes: exploration or estimation error
find the relation between these two parts, if any

Model formulation

$\mathcal{U}_r \subset \mathbb{R}^r$ a compact set if arms ($r\geq2$), The reward in period $t$ is given by

$X_{t}^{u}=u^{T}Z + W_{t}^{u},\tag{1}$

$Z\in\mathbb{R}^r$ is a random variable
$W_{t}^{u}$ are iid mean-zero random variables

A policy $\psi=(\psi_{1},\psi_{2},…)$ is a sequence of functions, each mapping from history to $\mathcal{U}_r$. For any policy $\psi$ and $z\in\mathbb{R}^r$, the cumulative regret under $\psi$ given $Z=z$ is

\[\text{Regret}(z,T,\psi)=\sum\limits_{t=1}^{T}E\bigg[\max_{v\in\mathcal{U}_{r}}v^Tz - U^{T}_{t}z\,\vert\,Z=z\bigg],\]

The cumulative Bayes risk under $\psi$ is the expectation w.r.t. the prior of $Z$.

\[\text{Risk}(T,\psi)=E[\text{Regret}(Z,T,\psi)]\]

regret table

Lower bounds

Linear bandits has $\Omega(r\sqrt{T})$ lower bounds on the Bayes risk and thus on regret, given normal prior on $Z$.

The cumulative risk can be lower-bounded by the estimator error variance and the total amount of exploration.
Lemma (risk decomposition) Let $S_{t}^{1},…, S_{t}^{r-1}$ denote a collection of orthogonal unit vectors that are also orthogonal to $\hat{Z}$. For any $T\geq 1$, $\text{Risk}(T,\psi)\geq \frac{1}{2}\sum\limits_{k=1}^{T}E \bigg[||Z||\sum\limits_{t=1}^{T}(T_{t}^{T}S_{t}^{k})^{2} + \frac{T}{||Z||} \{(Z-\hat{Z}_{T})^{T}S_{T}^{k}\}^2\bigg]$
The two terms are interelated. Little exploration implies large estimation error
Lemma For any $k$ and $T\geq 1$, $E[\{(Z-\hat{Z})^{T}S_{T}^{k}\}^{2}\vert H_{T}]\geq\frac{1}{r+\sum\limits_{t=1}^{T}(T_{t}^{T}S_{t}^{k})^{2} },$ where $r$ is prior precision of $Z$.
There is a lower bound on the probability that $Z$ is bounded away from 0.
Then we can derive a minimum directional risk.

Upper bounds: PEGE algorithm

Algorithm: phased-exploration-and-exploitation

There are two phases in each cycle $c\geq 1$:

Exploration ($r$ periods) play arm $b_k$. The explored arms span the entire space. compute the OLS estimate $\hat{Z}(c)=Z+\frac{1}{c}(\sum\limits_{k=1}^{r}b_{k}b_{k} ^{T})^{-1}\sum\limits_{s=1}^{c}\sum\limits_{k=1}^{r}b_{k}W^{b_{k}}(s)$
Exploitation ($c$ periods) play the greedy arm $G(c)=\arg\max_{v\in\mathcal{U}_{r}} v^{T}\hat{Z}(c)$. In the algorithm, $c$ is of the order $O(\sqrt{T})$.

Assumptions

Subgaussian noise
Arms are bounded, and include $r$ linearlly independent components
$\mathcal{U}_{r}$ satisfy smooth best arm response with parameter J $||u^{*}(z)-u^{*}(y)||\leq J||z-y||$

Upper bound

For PEGE, we can explicitly disentangle risk caused by exploration and misspecification

Thm. There exists a positive constant $a_{1}$ that depends only on the noise bounds, arm bounds and response bounds, such that for any $z$ and $T\geq r$, $\text{Regret}(z,T,\text{PEGE})\leq a_{1}(||z||+\frac{1}{||z||})r\sqrt{T}$

Since the arm bound provides a trivial bound $2\bar{u}\vert\vert z\vert\vert$ on instantaneous regret, the bound does not deteriorate as $\vert\vert z\vert\vert$ approaches 0.

Proof sketch:

There is an upper bound on the squared norm error $E[||\hat{Z}(c)-z||^{2}| Z=z]\leq \frac{h_{1}r}{c}$
Expected instantaneous regret under greedy decision is of order $O(\vert\vert Z-\hat{Z}\vert\vert^{2})$ given smoothness assumption.
Over total $K$ cycles, $K=O(\sqrt{T})$ $\text{Regret}\left(z,rK+\sum\limits_{c=1}^{K}c,\text{PEGE}\right)\leq h_{3}r||z||K+h_{4}\sum\limits_{c=1}^{K} \frac{r}{||z||}$

Share on

Twitter Facebook LinkedIn

Yifan Hong

[Paper Note] Linearly parameterized bandits

Model formulation

Lower bounds

Upper bounds: PEGE algorithm

Algorithm: phased-exploration-and-exploitation

Assumptions

Upper bound

Share on

You May Also Enjoy

Universal law of generalization (1)

[Paper Note] A utility criterion for Markov decision processes

[Paper Note] Risk-sensitive Markov decision processes

[Paper Note] Additive von Neumann-Morgenstern utility functions