Today, designers of network processors strive to keep the packet reception and transmission orders identical, and therefore avoid any possible out-of-order transmission. However, the development of new features in advanced network processors has resulted in increasingly parallel architectures and increasingly heterogeneous packet processing times, leading to large reordering delays. In this paper, we introduce novel scalable scheduling algorithms for preserving flow order in parallel multi-core network processors. We show how these algorithms can reduce reordering delay while adapting to any load-balancing algorithm and keeping a low implementation complexity overhead. To do so, we use the observation that all packets in a given flow have similar processing requirements and can be described with a constant number of logical processing phases. We further define three possible knowledge frameworks of the time when a network processor learns about these logical phases, and deduce appropriate algorithms for each of these frameworks. Finally, we model our proposed algorithms and simulate them under both synthetic traffic and real-life traces, and show that they significantly outperform past approaches.