Saurabh Paul, Christos Boutsidis, et al.
JMLR
We study contextual bandit algorithms for high-dimensional combinatorial action spaces involving a class of structured discrete constrained optimization. We propose a novel approach based on general-purpose regression oracles, leveraging Inverse Gap Weighting (IGW) for seamless integration within an optimization framework. IGW sampling is efficiently managed by: (a) Column generation reformulation of the underlying Mixed Integer Programming (MIP) model, enabling flexible lower-level predictors, causal coherence, and efficient representation of large action spaces. (b) Diverse solution pool generation to balance the exploration-exploitation trade-off in large action spaces. To address discontinuities in the reward function’s gradient due to optimization constraints, we incorporate a risk-averse phased learning strategy. Through ablation studies, we demonstrate that each component is critical for maximizing cumulative rewards. Finally, we validate the practical viability of our approach on a real-world Auto-Scaling problem in IT automation.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
C.A. Micchelli, W.L. Miranker
Journal of the ACM
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR