About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
INFORMS 2022
Talk
Dynamic Bandits with Temporal Structure
Abstract
Multi-armed bandit are mainly studied under two extreme settings known as stochastic and adversarial. The two aforementioned settings, however, does not capture realistic environments such as search engines and marketing and advertising, in which rewards stochastically change in time. Motivated by that, we introduce and study dynamic MAB with stochastic temporal structures. In this problem, the expected reward of each arm follows an auto-regressive (AR) model that governs the temporal structure of the rewards. Due to the dynamic nature of the rewards, simple "explore and commit" policies fail, as all arms have to get explored continuously over time. We formalize this by characterizing a per-round regret lower bound, where the regret is measured against a strong benchmark. We then present an algorithm whose per-round regret almost matches the regret lower bound.