Image datasets play a pivotal role in advancing computer vision and multimedia research. However, most of the datasets are created by extensive human effort and are extremely expensive to scale-up. To address these issues, several automatic and semi-automatic approaches have been proposed for creating datasets by refining web images. However, these approaches either include significant redundant images in the dataset or fail to provide a diverse enough set to train a robust classifier. Ideally, a representative subset should be both semantically and visually diverse so as to provide the maximum amount of information under the current budget. Most current approaches are entirely based on the analysis of visual features, which may not correlate well with image semantics, and hence, collected images may not be sufficient to give a detailed understanding of a category. In this paper, we propose a system for creating diverse image dataset collections from the web with limited manual labeling effort. It is based upon a semi-supervised sparse coding framework that employs a joint visual-semantic space to simultaneously utilize both the images and associated textual information from the web for dataset construction. In addition, the proposed system is online and is capable of collecting more discriminative images continuously as new data becomes available, which is also suitable for enriching the existing datasets. The experiments demonstrate that our system can create and enrich datasets with limited manual labeling, with better cross-dataset generalization capability and diversity compared to the state-of-the-art datasets.