The hypothesis that computational models can be reliable enough to be adopted in prognosis and patient care is revolutionizing healthcare. Deep learning, in particular, has been a game changer in building predictive models, thus leading to community-wide data curation efforts. However, due to inherent variabilities in population characteristics and biological systems, these models are often biased to the training datasets. This can be limiting when models are deployed in new environments, when there are systematic domain shifts not known a priori. In this paper, we propose to emulate a large class of domain shifts, that can occur in clinical settings, with a given dataset, and argue that evaluating the behavior of predictive models in light of those shifts is an effective way to quantify their reliability. More specifically, we develop an approach for building realistic scenarios, based on analysis of disease landscapes in multi-label classification. Using the openly available MIMIC-III EHR dataset for phenotyping, for the first time, our work sheds light into data regimes where deep clinical models can fail to generalize. This work emphasizes the need for novel validation mechanisms driven by real-world domain shifts in AI for healthcare.