Almost Hyperparameter-Free Hyperparameter Selection Framework for Offline Policy Evaluation
We are concerned with the problem of hyperparameter selection of offline policy evaluation (OPE). OPE is a key component of offline reinforcement learning, which is a core technology for data-driven decision optimization without environment simulators. However, the current state-of-the-art OPE methods are not hyperparameter-free, which undermines their utility in real-life applications. We address this issue by introducing a new approximate hyperparameter selection (AHS) framework for OPE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as convergence rate and time complexity. Finally, we verify effectiveness and limitation of these methods with a preliminary experiment.