About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
AAAI 2020
Conference paper
Uncorrected least-squares temporal difference with lambda-return
Abstract
Temporal difference, TD(λ), learning is a foundation of reinforcement learning and also of interest in its own right for the tasks of prediction. Recently, true online TD(λ) has been shown to closely approximate the “forward view” at every step, while conventional TD(λ) does this only at the end of an episode. We re-examine least-squares temporal difference, LSTD(λ), which has been derived from conventional TD(λ). We design Uncorrected LSTD(λ) in such a way that, when λ = 1, Uncorrected LSTD(1) is equivalent to the least-squares method for the linear regression of Monte Carlo (MC) return at every step, while conventional LSTD(1) has this equivalence only at the end of an episode, since the MC return is corrected to be unbiased. We prove that Uncorrected LSTD(λ) can have smaller variance than conventional LSTD(λ), and this allows Uncorrected LSTD(λ) to sometimes outperform conventional LSTD(λ) in practice. When λ = 0, however, Uncorrected LSTD(0) is not equivalent to LSTD. We thus also propose Mixed LSTD(λ), which matches conventional LSTD(λ) at λ = 0 and Uncorrected LSTD(λ) at λ = 1. In numerical experiments, we study how the three LSTD(λ)s behave under limited training data.