Large scale, multi-label text datasets with a high number of classes are expensive to annotate, even more so if they belong to specific language domains. In this work, we aim to build classifiers for these datasets using Active Learning in order to reduce the labeling effort. We outline the challenges when dealing with extreme multi-label settings and show the limitations of existing pool-based Active Learning strategies by considering their effectiveness as well as efficiency in terms of computational cost. In addition, we present five multi-label datasets which were compiled from hierarchical classification tasks to serve as benchmarks in the context of extreme multi-label classification for future experiments. Finally, we provide insights into multi-class, multi-label evaluation and present an improved classifier architecture on top of pre-trained transformer language models.