Dynamic Bandits with Temporal Structure

Djallel Bouneffouf; Qinyi Chen; Negin Golrezaei

INFORMS 2022

Talk

16 Oct 2022

Dynamic Bandits with Temporal Structure

Abstract

Multi-armed bandit are mainly studied under two extreme settings known as stochastic and adversarial. The two aforementioned settings, however, does not capture realistic environments such as search engines and marketing and advertising, in which rewards stochastically change in time. Motivated by that, we introduce and study dynamic MAB with stochastic temporal structures. In this problem, the expected reward of each arm follows an auto-regressive (AR) model that governs the temporal structure of the rewards. Due to the dynamic nature of the rewards, simple "explore and commit" policies fail, as all arms have to get explored continuously over time. We formalize this by characterizing a per-round regret lower bound, where the regret is measured against a strong benchmark. We then present an algorithm whose per-round regret almost matches the regret lower bound.

Talk