Recent advances in wearable sensor technologies have led to a variety of approaches for detecting physiological stress. Even with over a decade of research in the domain, there still exist many significant challenges, including a near-Total lack of reproducibility across studies. Researchers often use some physiological sensors (custom-made or off-The-shelf), conduct a study to collect data, and build machine-learning models to detect stress there is little effort to test the applicability of the model with similar physiological data collected from different devices, or the efficacy of the model on data collected from different studies, populations, or demographics. This paper takes the first step towards testing reproducibility and validity of methods and machine-learning models for stress detection. To this end, we analyzed data from 90 participants, from four independent controlled studies, using two different types of sensors, with different study protocols and research goals. We started by evaluating the performance of models built using data from one study and tested on data from other studies. Next, we evaluated new methods to improve the performance of stress-detection models and found that our methods led to a consistent increase in performance across all studies, irrespective of the device type, sensor type, or the type of stressor. Finally, we developed and evaluated a clustering approach to determine the stressed/not-stressed classification when applying models on data from different studies, and found that our approach performed better than selecting a threshold based on training data. This paper's thorough exploration of reproducibility in a controlled environment provides a critical foundation for deeper study of such methods, and is a prerequisite for tackling reproducibility in free-living conditions.