Sequential Decision Making (SDM) With Long Term Reward Estimates.
Abstract
We study optimal control problems, wherein actions to be taken at a time step are dependent on a dynamical system’s behavior such as realized demand or process behavior in a plant. Traditional SDM methods may not account for long-term effects of decisions or changing conditions. In our framework, we extend our objective function with two parts - the first part is the rewards within the time horizon for optimal policy, and we also include a modifier to the objective. The modifier term is the expected remaining reward after the recommended policies. Therefore, we can capture the holistic reward of the scenario, allowing us to ensure that the recommended policies benefit not only the period of the recommendation but also that the system is brought into a good system state for better future rewards. We benchmark the performance of our model on a use-case from a processing plant.