Information-theoretic clustering is one of the most promising and principled approaches to finding clusters with minimal apriori assumptions. The key criterion therein is to maximize the mutual information between the data points and their cluster labels. Such an approach, however, does not explicitly promote any type of inter-cluster behavior. We instead propose to maximize the Kullback-Leibler divergence between the underlying data distributions associated to clusters (referred to as cluster distributions). We show it to entail the mutual information criterion along with maximizing cross entropy between the cluster distributions. For practical efficiency, we propose to empirically estimate the objective of KL-D between clusters in its dual form leveraging deep neural nets as a dual function approximator. Remarkably, our theoretical analysis establishes that estimating the divergence measure in its dual form simplifies the problem of clustering to one of optimally finding k − 1 cut points for k clusters in the 1-D dual functional space. Overall, our approach enables linear-time clustering algorithms with theoretical guarantees of near-optimality, owing to the submodularity of the objective. We show the empirical superiority of our approach w.r.t. current state-of-the-art methods on the challenging task of clustering noisy timeseries as observed in domains such as neuroscience, healthcare, financial markets, spatio-temporal environmental dynamics, etc.