|
|
Abstracts
| |
Chip-level Integration Tradeoffs�the New Frontier for Microprocessor Design
Jaime H. Moreno, IBM T.J. Watson Research Center
Microelectronic technology has been a fundamental factor in microprocessor performance growth throughout the years. Smaller/faster transistors and larger chips have made possible the development of ever more complex and powerful microprocessors, by exploiting techniques such as deeper pipelines,
instruction- and data-level parallelism, thread-level parallelism, speculative execution, wide instruction issue, and so on. However, such growth has been facing serious challenges recently due to limits in power consumption and power density. While the search for solutions to those microelectronic technology challenges is ongoing, a new trend has emerged, which targets performance improvements through the integration of functionality at the chip level, instead of focusing on a single processor core. The range of opportunities made possible by chip level integration is extremely broad and complex, and represents a new frontier for microprocessor design, wherein there is an increased interaction between microelectronics, circuits, microarchitecture, architecture, compiler, and system techniques. These interactions lead to challenging tradeoffs among the possible alternatives; some examples of such tradeoffs are the characteristics of individual processors within a multicore chip, whether symmetric or asymmetric in their functionality, with the same instruction-set architecture or with instruction extensions, or even completely different instruction-set architecture, the interconnection fabric on-chip and to the external world, the power/performance features of these components, the on-chip memory hierarchy, the use of on-chip accelerators, and so on. As a result, there is an immense space of opportunities created by the limitations encountered by the traditional techniques used for microprocessors until recently. However, there are also important restrictions in the exploitation of these opportunities in general-purpose processors, in particular for features visible at the architecture level, due to the legacy implications that characterize such innovations.
In this talk, we review our reasoning behind the need for evolution in chip-level integration for microprocessors, and the challenges found in pursuing such objectives. We summarize the need for identifying emerging tradeoffs, beyond those found in straight-forward multicore chips, as well as the need to determine how these tradeoffs can be investigated, and how proposed innovations can be incorporated through chip-level integration. This talk focuses on describing challenges as opposed to proposing specific solutions, and illustrates those challenges with some examples; in doing so, the talk focuses on the description of opportunities in this emerging area, which appears ready for new research and development.
Everything You Always Wanted to Know about SCALABILITY and Were Afraid to Ask
Ronny Ronen, Intel Israel
In the past, microprocessors followed Moore's law-process technology provided us with smaller, faster, and less power consumption every two years or so and we used that to create bigger processors. In this talk, I will discuss what scalability means in that context, what are the metrics we used (performance, power, die size, complexity, effort), what "creates" scalability (process, architecture, micro-architecture, etc�), and what are the expected trends.
Dynamically Controlled Resource Allocation in SMT Processors
Francisco J. Cazorla, UPC, Enrique Fernandez, University of Las Palmas de Gran Canaria, Alex Ramirez, UPC, Mateo Valero, UPC
SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss.
In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance.
Inthreads: Code Generation and Implementation Notes
Alex Gontmakher, Assaf Schuster, Gregory Shklover, Computer Science department, Technion
The Inthreads programming model allows for extremely low-granularity code parallelization. Inthreads are lightweight threads running simultaneously inside the processor, which share all of the processor state, including the architectural register file.
The context sharing poses new challenges in the implementation of Inthreads. In our talk, we will describe our approach to the code generation and show some of the microarchitectural mechanisms for Inthreads support.
Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL
Tevi Devor, Intel Israel
IA-32EL is the Intel dynamic binary translator from IA-32 to IPF (Itanium Processor Family), which enables applications compiled for IA-32 to run on the IPF, by dynamically translating all IA-32 instructions in the application to IPF instructions, so that only IPF instructions are executed.
This presentation gives an overview of IA-32EL, and discusses in detail the insertion of IPF data prefetch instructions into the translated code in order to reduce cache misses and increase performance.
An Efficient Parallel Heap Compaction Algorithm
Diab Abuaiadh, Yoav Ossia, Erez Petrank, Uri Silbershtein, IBM Haifa Labs
We propose a heap compaction algorithm appropriate for modern computing environments. Our algorithm is targeted at SMP platforms. It demonstrates high scalability when running in parallel but is also extremely efficient when running single-threaded on a uniprocessor. Instead of using the standard forwarding pointer mechanism for updating pointers to moved objects, the algorithm saves information for a pack of objects. It then does a small computation to process this information and determine each object's new location. In addition, using a smart parallel moving strategy, the algorithm achieves (almost) perfect compaction in the lower addresses of the heap, whereas previous algorithms achieved parallelism by compacting within several predetermined segments. Next, we investigate a method that trades compaction quality for a further reduction in time and space overhead. Finally, we propose a modern version of the two-finger compaction algorithm. This algorithm fails, thus, re-validating traditional wisdom asserting that retaining the order of live objects significantly improves the quality of the compaction.
The parallel compaction algorithm was implemented on the IBM production Java Virtual Machine. We provide measurements demonstrating high efficiency and scalability. Subsequently, this algorithm has been incorporated into the IBM production JVM.
Keynote: Trends in Compilers and Programming Languages at Apple Computer
Ted Goldstein, Apple Computer
Ted Goldstein is currently Vice President of Development Technologies at Apple Computer where he oversees programming languages, developer tools and web application servers. Mr. Goldstein is an innovator in electronic commerce, programming environments and smart card technologies. He holds eight patents related to programming language design and implementation and E-commerce technologies. Previously, he held the titles of CTO and Vice President of Engineering at The Brodia Group and later he was Vice President of Business and Technology at ActivCard. As Chief Java Commerce Officer at Sun Microsystems, Mr. Goldstein spearheaded development of Java Card, the most popular smart card platform. While at Xerox's Palo Alto Research Center, Mr. Goldstein was a cofounder and member of the managing team that spun off ParcPlace Systems, the inventors of the Smalltalk system. Mr. Goldstein holds a Bachelor of Arts degree in Computer and Information Science from the University of California at Santa Cruz.
Autovectorization in GCC
Dorit Naishlos, IBM Haifa Labs
Vectorization is an optimization technique that has traditionally targeted vector processors. The importance of this optimization has increased in recent years with the introduction of SIMD (Single Instruction Multiple Data) extensions to general purpose processors, and with the growing significance of applications that can benefit from this functionality. With the adoption of the new Tree SSA optimization framework, GCC is ready to take on the challenge of automatic vectorization. In this talk we describe the design and implementation of a loop-based vectorizer in GCC. We discuss the new issues that arise when vectorizing for SIMD extensions as opposed to traditional vectorization, and the challenges of doing that in a multi-platform compiler such as GCC.
Validating the Translation of an Industrial Optimizing Compiler
Ira Gordin, Raya Leviathan, Dept. of Computer Science, Weizmann Institute of Science and Amir Pnueli, New York University
The paper presents an approach to the translation validation of an optimizing compiler which translates synchronous C programs into machine code programs. Being synchronous means that both source and target programs are loop free. This enables representation of each of these programs by a single state transformer, and verification of the translation correctness is based on comparison of the source and target state transformers.
The approach has been implemented on a tool called MCVT, which is also described.
Stabilization Preserving and Enforcing Compiler
Olga Brukman, Shlomi Dolev, Yinnon Haviv, Computer Science department, Ben-Gurion University and Mooly Sagiv, Computer Science department, Tel Aviv University
Self-Stabilization is an elegant approach for designing fault tolerant systems. A system is considered self-stabilizing if, started in any state, it eventually converges to the desired behavior. Since originally defined by Dijkstra in 1974, self-stabilizing algorithms were extensively studied and many algorithms were designed for solving fundamental distributed tasks, such as leader election, token circulation, consensus etc. These algorithms were expressed using guarded commands or pseudo-code. The realization of these algorithms requires the existence of infrastructure for their execution, namely, self-stabilizing hardware, self-stabilizing operating system, etc. Moreover, these algorithms need to be converted into machine language of the self-stabilizing infrastructure, e.g., the machine code of a self-stabilizing microprocessor. In this work, we present a self-stabilization preserving compiler for supporting the implementation of self-stabilizing systems. We start by defining the task of such a compiler and how it differs from the known compiler design. We explain our choice of ASM as a language for writing self-stabilizing algorithms, and continue with a bird's eye view on the compiler implementation. In addition, we will present compilation techniques for designing recovery-oriented code. The compiler is designed to compile program (safety and liveness) invariants such that the invariants will not be violated during run-time, and be enforced following transient faults.
|
|
|
|
|