Human annotation in large scale image databases is time-consuming and error-prone. Since it is very hard to mine image databases using just visual features or textual descriptors, it is common to transform the image features into a semantically meaningful space. In this paper, we propose to perform image annotation in a semantic space inferred based on sparse representations. By constructing a semantic embedding for the visual features, that is constrained to be close to the tag embedding, we show that a robust inverse map can be used to predict the tags. Experiments using standard datasets show the effectiveness of the proposed approach in automatic image annotation when compared to existing methods.