Markov Decision Process Framework for Control-Based Reinforcement Learning
Abstract
For many years, reinforcement learning (RL) has proven to be very successful in solving a wide variety of learning and decision making under uncertainty (DMuU) problems, including those related to game playing and robotic control. Many different RL approaches, with varying levels of success, have been developed to address these problems. Among these different approaches, model-free RL has been successful in solving various DMuU problems without any prior knowledge. Such model-free approaches, however, often suffer from high sample complexity that can require an inordinate amount of samples for some problems which can be prohibitive in practice, especially for problems limited by time or other constraints. Model-based RL has been successful in demonstrating significantly reduced sample complexity and in outperforming model-free approaches for various DMuU problems. Such model-based approaches, however, can often suffer from the difficulty of learning an appropriate system model and from worse asymptotic performance than model-free approaches due to model bias from inherently assuming that the learned system dynamics model accurately represents the true system environment; in addition, an approximate solution of the optimal control policy is often obtained based on the learned system dynamics model. We propose herein a novel form of RL for seeking to directly learn an optimal control policy of a general underlying (unknown) dynamical system and to directly apply the corresponding learned optimal control policy within the dynamical system. This general approach is in strong contrast to many traditional model-based RL methods that, after learning the system dynamics model often of high complexity and dimensionality, then use this system dynamics model to compute an optimal solution of a corresponding dynamic programming problem, often applying model predictive control. Our control-based RL approach instead learns the optimal parameters that derive an optimal policy function from a family of control policy functions, often of much lower complexity and dimensionality, from which the optimal control policy is directly obtained. Furthermore, we establish that our general approach converges to an optimal solution analogous to model-free RL approaches while eliminating the problems of model bias in traditional model-based RL approaches. The theoretical foundation and analysis of our control based RL approach is introduced within the context of a general Markov decision process (MDP) framework that extends the policy associated with the classical Bellman operator to a family of control policy functions derived from a corresponding parameter set, expands the domain of these policies from a single state to span across states, and extends the associated optimality criteria through these generalizations of the definition and scope of a control policy, all providing theoretical support for our general control-based RL approach. Within this MDP framework, we establish results on convergence w.r.t. both a contraction operator and a corresponding form of Q-learning, establish results on various aspects of optimality and optimal control policies, and introduce a new form of policy-parameter gradient ascent. To the best of our knowledge, this is the first proposal and analysis of such a general control-based RL approach based on theoretical support from an underlying extended MDP framework. Generally speaking, the basic idea of learning a parameterized policy within an MDP framework to reduce sample complexity is not a new idea. One such popular approach concerns policy gradient methods, where gradient ascent of the value function in a space of policies is used together with projection to obtained an optimal policy. These ideas have been further refined in neural network based policy optimization approaches such as TRPO and PPO. In strong contrast, our proposed approach derives the optimal policy through control-policy functions that map estimates of a few global (and local) parameters to optimal control policies in an iterative manner based on observations from applying the control policy of the current estimate of parameters. We next present the MDP framework supporting our general approach that directly learns the parameters of the optimal control policy, together with the corresponding theoretical results on convergence and optimality as well as a new form of policy-parameter gradient ascent. We refer to Ref. [3] for all proofs and additional details and references.