In many important application domains, such as text categorization, scene classification, biomolecular analysis and medical diagnosis, examples are naturally associated with more than one class label, giving rise to multi-label classification problems. This fact has led, in recent years, to a substantial amount of research in multi-label classification. In order to evaluate and compare multi-label classifiers, researchers have adapted evaluation measures from the single-label paradigm, like Precision and Recall; and also have developed many different measures specifically for the multi-label paradigm, like Hamming Loss and Subset Accuracy. However, these evaluation measures have been used arbitrarily in multi-label classification experiments, without an objective analysis of correlation or bias. This can lead to misleading conclusions, as the experimental results may appear to favor a specific behavior depending on the subset of measures chosen. Also, as different papers in the area currently employ distinct subsets of measures, it is difficult to compare results across papers. In this work, we provide a thorough analysis of multi-label evaluation measures, and we give concrete suggestions for researchers to make an informed decision when choosing evaluation measures for multi-label classification.