Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits

Aadirupa Saha; Shubham Gupta

ICML 2022

Conference paper

17 Jul 2022

Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits

Abstract

We study the problem of dynamic regret minimization in K-armed Dueling Bandits under non-stationary or time-varying preferences. This is an online learning setup where the agent chooses a pair of items at each round and observes only a relative binary 'win-loss' feedback for this pair sampled from an underlying preference matrix at that round. We first study the problem of static-regret minimization for adversarial preference sequences and design an efficient algorithm with Õ(√KT) regret bound. We next use similar algorithmic ideas to propose an efficient and provably optimal algorithm for dynamic-regret minimization under two notions of non-stationarities. In particular, we show Õ(√SKT) and Õ(VT1/3K1/3T2/3) dynamic-regret guarantees, respectively, with S being the total number of 'effective-switches' in the underlying preference relations and VT being a measure of 'continuous-variation' non-stationarity. These rates are provably optimal as justified with matching lower bound guarantees. Moreover, our proposed algorithms are flexible as they can be easily 'blackboxed' to yield dynamic regret guarantees for other notions of dueling bandits regret, including condorcet regret, best-response bounds, and Borda regret. The complexity of these problems have not been studied prior to this work despite the practicality of non-stationary environments. Extensive simulations corroborate our results.

Conference paper