With increasing number of open-source deep learning (DL) software tools made available, benchmarking DL software frameworks and systems is in high demand. This paper presents design considerations, metrics and challenges towards developing an effective benchmark for DL software frameworks and illustrate our observations through a comparative study of three popular DL frameworks: TensorFlow, Caffe, and Torch. First, we show that these deep learning frameworks are optimized with their default configurations settings. However, the default configuration optimized on one specific dataset may not work effectively for other datasets with respect to runtime performance and learning accuracy. Second, the default configuration optimized on a dataset by one DL framework does not work well for another DL framework on the same dataset. Third, experiments show that different DL frameworks exhibit different levels of robustness against adversarial examples. Through this study, we envision that unlike traditional performance-driven benchmarks, benchmarking deep learning software frameworks should take into account of both runtime and accuracy and their latent interaction with hyper-parameters and data-dependent configurations of DL frameworks.