As computation-intensive tasks such as deep learning and big data analysis take advantage of GPU based accelerators, the interconnection links may become a bottleneck. In this paper, we investigate the upcoming performance bottleneck of multi-accelerator systems, as the number of accelerators equipped with single host grows. We instrumented the host PCIe fabric to measure the data transfer and compared it with the measurements from the software tool. It shows how the data transfer (P2P) helps to avoid the bottleneck on the interconnection links, but multi-GPU performance does not scale up as expected due to the control messages. We quantify the impact of host control messages with suggestions to remedy scalability bottlenecks. We also implement the proposed strategy on Lulesh to validate the concept. The result shows our strategy can save 59.86% time cost of the kernel and 13.32% PCIe H2D payload.