AAAI 2020
Conference paper

Uncorrected least-squares temporal difference with lambda-return

Download paper


Temporal difference, TD(λ), learning is a foundation of reinforcement learning and also of interest in its own right for the tasks of prediction. Recently, true online TD(λ) has been shown to closely approximate the “forward view” at every step, while conventional TD(λ) does this only at the end of an episode. We re-examine least-squares temporal difference, LSTD(λ), which has been derived from conventional TD(λ). We design Uncorrected LSTD(λ) in such a way that, when λ = 1, Uncorrected LSTD(1) is equivalent to the least-squares method for the linear regression of Monte Carlo (MC) return at every step, while conventional LSTD(1) has this equivalence only at the end of an episode, since the MC return is corrected to be unbiased. We prove that Uncorrected LSTD(λ) can have smaller variance than conventional LSTD(λ), and this allows Uncorrected LSTD(λ) to sometimes outperform conventional LSTD(λ) in practice. When λ = 0, however, Uncorrected LSTD(0) is not equivalent to LSTD. We thus also propose Mixed LSTD(λ), which matches conventional LSTD(λ) at λ = 0 and Uncorrected LSTD(λ) at λ = 1. In numerical experiments, we study how the three LSTD(λ)s behave under limited training data.


07 Feb 2020


AAAI 2020