Graph analytics are arguably one of the most demanding workloads for high-performance systems and interconnection networks. Graph applications often display all-to-all, fine-grained, high-rate communication patterns that expose the limits of the network protocol stacks. Load and communication imbalance generate hard-to-predict network hot-spots, and may require computational steering due to unpredictable data distributions. In this paper we present a lightweight communication library, implemented "on the metal" of BlueGene/Q and POWER7 IH that we have used to support large-scale graph algorithms up to 96K processing nodes and 6 million threads. With this library we have explored several optimization techniques, including overlapped communication, non-blocking collectives, message aggregation, and computation in the network for special collective communication patterns, such as parallel prefix. The experimental results show significant performance improvements, ranging from 5X to 10X, when compared to equally optimized MPI implementations.