Self-supervised pre-trained models have consistently delivered state-of-art results in the fields of natural language and speech processing. However, we argue that their merits for modeling Turn-Taking for spoken dialogue systems still need further investigation. Due to that, in this paper we introduce a modular End-to-End system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features to model the specific Turn-Taking task of End-of-Turn Detection (EOTD). Several architectures to model the EOTD task. using audio-only, text-only and audio+text modalities are presented, and their performance and robustness are carefully evaluated for three different human-to-human spoken dialogue datasets. The proposed model not only achieves SOTA results for EOTD, but also brings light to the possibility of powerful and well fine-tuned self-supervised models to be successfully used for a wide variety Turn-Taking tasks.