Retrosynthesis is an important problem in chemistry and represents an interesting challenge for AI since it involves predictions over sets of complex, molecular graph structures. Recently, a wealth of models ranging from language models to graph neural networks are being proposed. However, most studies evaluate over a single dataset and split only, focus on top-1 accuracy, and provide few insight into the actual capabilities of individual models. This prevents research from moving forward since issues to be addressed by future work are not identified. In this paper, we focus on the evaluation: we show that the currently used data does not fit to test generalization, one of the main goals stated in the literature; propose new splits of the USPTO reactions modeling various scenarios; study representatives of the main types of models over this data; and finally present the, to the best of our knowledge, first evaluation and comparison of these models in the multi-step scenario. Altogether, we show that the picture is more diverse than the results over the usually used USPTO-50k data suggest.