Humans outperform object recognizers despite the fact that models perform well on current datasets. Numerous efforts exist to make more challenging datasets by scaling up on the web, exploring distribution shift, or adding controls for biases. The difficulty of each image in each dataset is not independently evaluated, nor is the concept of dataset difficulty as a whole currently well defined. We develop a new dataset difficulty metric based on how long humans must view an image in order to classify a target object. Images whose objects can be recognized in 17ms are considered to be easier than those which require seconds of viewing time. Using 133,588 judgments on two major datasets, ImageNet and ObjectNet, we determine the distribution of image difficulties in those datasets, which we find varies wildly, but significantly undersamples hard images. Rather than hoping that distribution shift will lead to hard datasets, we should explicitly measure their difficulty. Analyzing model performance guided by image difficulty reveals that models tend to have lower performance and a larger generalization gap on harder images. We release a dataset of difficulty judgments as a complementary metric to raw performance and other behavioral/neural metrics. Such experiments with humans allow us to create a metric for progress in object recognition datasets. This metric can be used to both test the biological validity of models in a novel way, and develop tools to fill out the missing class of hard examples as datasets are being gathered.