Scalable instruction-level parallelism through tree-instructions
Abstract
We describe a representation of instruction-level parallelism which does not require checking dependencies at run-time, and which is suitable for processor implementations with varying issue-width. In this approach, a program is represented as a sequence of tree-instructions, each containing multiple primitive operations. These tree-instructions are executable either in one or multiple cycles, do not require a specific processor organization, and are generated assuming an implementation capable of large instruction-level parallelism. During instruction cache reloading/accessing, tree-instructions are decomposed into subtrees which fit the actual resources available in an implementation. The resulting subtrees require a simple instruction-dispatch mechanism, as in the case of statically scheduled processors. The representation makes practical the use of the same parallelized code in implementations with different issue-width; simulation results indicate that the instruction-level parallelism achievable with this approach degrades less than 10% with respect to code compiled for each specific implementation.