|
|
Abstracts
| |
EXPRESS: A Rapid Prototyping/Development Environment for Embedded Computer Systems
Alex Nicolau, University of California, Irvine
In this talk we will describe the EXPRESS environment developed at the University of California, Irvine over the last five years, drawing on our research experience in the preceeding decade.
EXPRESS combines a sophisticated, highly optimizing, highly retargetable compiler with a similarly retargetable simulation environment. Using a new architectural description language (EXPRESSION) the environment can be rapidly and accurately retargeted to a variety of computer systems, that include complex processing units (ILP superscalar/vliw highly-pipelined architectures), with complex memory organizations (multiple register files, on-chip and off chip directly addressable sram, various memory accessing modes, caches, etc) and idiosynchratic system organizations (co-processors, complex bus structures, special-purpose operations, add-on units, etc). EXPRESS therefore allows the quick generation of cycle-accurate simulators together with highly optimized compilers that "track" changes (introduced by the architect/designer) in the complex, realistic, architecture, and provides visual and quantitative tools to quickly evaluate cost/benefit ratios of such changes with respect to performance, power consumption and hardware/area cost. Thus, our environment holds, perhaps for the first time, the promise of truly meaningful architectural exploration and thereby potentially better system designs, while simultaneously (significantly) shortening the design cycle.
Data Cache Design and Evaluation for SMT Processors
Ron Pinter and Haggai Yedidya, Technion
Simultaneous Multi Threading (SMT) is a natural way to expand the capabilities of super-scalar processors, in which several threads simultaneously issue instructions that are executed by the processor's functional units. This structure helps to surpass the instruction level parallelism (ILP) limitation of a serial program, and to better utilize processor resources. It is reasonable to assume that a processor running several threads concurrently will have a different memory access pattern than a serial one. This assumption calls for cache optimization, targeted at this type of processors.
This work aims to find the optimal first level data cache configuration for SMT processors, using the building blocks of current caches. Different cache constructs are tested using two workload varieties: multi-process and multi-threaded. Experiments were conducted by running benchmarks on a performance simulator (SMTSIM). The results show how performance is affected by varying certain parameters, especially hash function, replacement policy, and level of associativity.
Subsetting SPEC When Measuring Results: Research vs. Industry
Daniel Citron, IBM Haifa Labs
A majority of the papers published in leading computer architecture conferences use SPEC CPU2000, which has become the de facto standard for measuring processor and/or memory-hierarchy performance. However, in most cases a subset of the suite's benchmarks are simulated.
This talk quantifies the extent of this phenomenon in the ISCA, Micro, and HPCA conferences: 209 papers were surveyed, 138 used benchmarks from SPEC CINT, but only 30 used the whole suite. If this current trend continues, by the year 2005 80\% of the papers will use the full CINT2000 suite, a year after CPU2004 shall be announced.
In addition, we will show that the disregard for the floating-point portion (CFP2000) of the suite is unwarranted, particularly in papers that explore the memory hierarchy. This is in direct contrast to the use of the CFP suite by industry.
Finally, we will attempt to extrapolate the use of CPU2000 by researchers to industry: How does subsetting influence SPECratios?
Are particular benchmarks more representative than others? And do the systems simulated by researchers reflect the trend in industry?
Schemes for the Efficient Concurrent Use of Multiple Prefetchers
Alex Gendler, Avi Mendelson, and Yitzhak Birk, EE Department, Technion, and Intel Haifa
As we approach the memory wall, much research effort is being aimed at improving the hardware and software prefetching mechanisms. Moreover, system designers are beginning to devote greater die-area and more power consumption for supporting multiple prefetching mechanisms on die. The use of a combination of sophisticated prefetching mechanisms yields significant performance gains for some important applications, but causes a major increase in bus traffic and in power consumption. Our research focuses on improving the cost-performance of systems that use multiple prefetchers. We introduce and explore schemes that employ a �feedback� mechanism: the success of each of the prfetching mechanism is monitored continuously, either globally or with finer resolution, and any given prefetcher is activated only when it appears successful during some time and/or address window.
Our presentation will focus on several techniques that aim to achieve the same performance as activating all the prefetchers, while significantly reducing the memory accesses, bus traffic and tag probes. We show that when applying the new proposed technique for prefetching from the main memory to the L2 cache, we can reduce the number of loads from the main memory by up to 25% without losing performance. When applying more sophisticated techniques to prefetchers between the L2- and the L1-cache, we can improve the performance (in terms of IPC) by 4% and reduce the loads bandwidth between the caches by a factor of 8.
Power Awareness through Selective Dynamically-Optimized Traces
Roni Rosner, Yoav Almog, Naftali Schwartz, Avi Mendelson, Micha Moffie, Ari Schmorak and Ronny Ronen, Microprocessor Research, Intel Labs, Haifa
This talk presents a new concept for designing high-performance, power-aware microarchitectures, called PARROT, and demonstrates the potential of dynamic optimizations within the PARROT microarchitectural framework. The PARROT microarchitecture identifies the frequently used (�hot�) traces and applies dynamic optimizations, implemented in hardware and executed at run time, to improve performance and energy consumption.
By decoupling trace-processing from the execution pipeline and selectively applying dynamic optimizations to the hottest traces we maximize the performance gain at any given power constraint, thus attaining finer control of tradeoffs between performance and power awareness. We demonstrate the advantage of PARROT methods as alternatives or in conjunction with conventional microarchitectural techniques.
We investigate the benefit of providing optimizations which although tightly coupled with the microarchitecture in substance are decoupled in time. The tight coupling in substance provides the potential for tailoring optimizations in a manner impossible or impractical not only for traditional static compilers but even for a JIT. We show that the contribution of common, generic optimizations to processor performance and energy efficiency may be more than doubled by creating a more intimate correlation between hardware specifics and the optimizer. In particular, dynamic optimizations can profit greatly from hardware supporting fused and SIMDified operations.
At the same time, the decoupling in time allows optimizations to be arbitrarily aggressive without significant performance loss. We demonstrate that requiring hundreds of repetitions before a trace is optimized sacrifices almost no performance or energy efficiency as compared with lower thresholds. These results confirm the feasibility of energy efficient hardware implementation of an aggressive optimizer.
Static Detection of Thread Local Storage in C
Yair Sade and Mooly Sagiv, Tel Aviv University
Memory management in multithreaded environments suffers from synchronization overhead. The memory allocation of one thread needs to be synchronized with all other threads. Advanced libraries for memory managers mitigate the overhead.
In this work we develop a framework for improving memory managers in multithreaded environments. A static analysis (compile-time) tool detects thread-local allocations, i.e., allocations of memory which are only used in a single thread. We develop dynamic an almost synchronization-free memory manager for the thread-local storage allocations.
Our initial experimental results show that thread-local storage improves runtime performance by 10-20% when two processors are used on malloc intensive benchmarks. We expect higher speedup when more processors will be used. Our current static analysis algorithms are not precise enough on large programs, i.e., they do not identify enough opportunities for thread-local storage. We are currently developing more precise (and potentially expensive) static estimation of thread-local storage.
Keynote: Kilo-instructions Processors
Mateo Valero, Universidad Politecnica de Catalunya (UPC), Spain
A promising approach for dealing with very long latency memory accesses (cache misses to main memory) is to dramatically increase the number of in-flight instructions in an out-of-order processor. Current processors commit instructions in program order. Consequently, a huge quantity of resources are needed to maintain thousands of instructions in flight. We need to do research in new techniques oriented to better use of resources. We observe that many inefficiencies can be eliminated if we change the model of in order commit of instructions. We need to design processors that support some form of out-of-order commit of instructions. But, of course, we also need to maintain precise exceptions.
To implement out-of-order instruction commit, we propose checkpointing a few very specific instructions with the objective of reducing and managing all the critical resources in the architecture such as ROB, Register File and Instruction Queues. We apply checkpointing, for example, to long-latency load instructions or hard-to-predict branch instructions. This mechanism of checkpointing:
- makes the existence of the classical ROB unnecessary.
- allows release the resources in an aggressive way. For example, strategic checkpointing allows an efficient implementation of early release and late allocation of registers.
- allows more intelligent management of the instruction queues.
In this talk, we will comment some of our papers describing the previous mechanisms and we will open new topics for research.
JavaSplit: A Runtime for Execution of Monolithic Java Programs on Heterogeneous Collections of Commodity Workstations
Assaf Schuster, Konstantin Shagin, and Michael Factor, Technion and IBM Haifa Labs
We present JavaSplit, a portable runtime for distributed execution of multithreaded Java programs. JavaSplit transparently distributes threads and objects of an application among the participating machines. Thus, it gains augmented computational power and increased memory capacity without modifying the Java multithreaded programming conventions.
JavaSplit works by rewriting the bytecodes of a given parallel application, transforming it into a distributed application that incorporates all the runtime logic. Each runtime node carries out its part of the resulting distributed computation using nothing but its local standard (unmodified) Java Virtual Machine (JVM). This is unlike previous Java-based distributed runtime systems, which use a specialized JVM or utilize unconventional programming constructs. Since JavaSplit is orthogonal to the implementation of a local JVM, it achieves portability across any existing platform and allows each node to locally optimize the performance of its JVM, e.g., via a just-in-time compiler (JIT).
Choosing Among Alternative Pasts
Marina Biberstein, Shmuel Ur, and Eitan Farchi, IBM Haifa Labs
The main problem with testing concurrent programs is their non-determinism, i.e., two executions with the same input that yield different results due to changed thread schedule (aka interleaving). This problem is aggravated by the fact that most thread schedulers are almost deterministic, generating the same interleavings over and over for a given testing environment. The traditional approach to testing concurrent programs is to identify and examine the race conditions. A different solution is noise-making, which generates different interleavings at runtime, for example using embedded sleep statements. The advantages of the latter approach include the ability to identify more problems and the absence of the false alarms.
This paper proposes a totally different technique for the generation of a rich set of interleavings. In this approach, operations on shared variables are tracked. Every time a shared variable is read, the value to be read is selected from the set of values held by this variable during the program execution. Only those values that the variable could hold in some interleaving consistent with the past observed events, are considered. Within this subset, the value choice can be random, biased-random, based on coverage, etc.
The problem of identifying read values that are consistent with the past observations is far from simple, since past decisions on value selection affect future ones. Our solution is computationally intensive and, therefore, impractical as is. However, insights gained from this solution lead to new heuristics for noise-making.
Overlapping Memory Operations with Circuit Evaluation in Hardware Compilation
Yosi Ben-Asher, Haifa University, and Gad Haber, IBM Haifa Labs
We consider the problem of compiling programs in a general high level programming language to hardware circuits to be executed by an FPGA. In particular we consider the problem of synthesizing nested loops which frequently access array elements stored in an external memory (out of the FPGA). Previous works in hardware compilation and reconfigurable architectures suspended circuit evaluation in order to fetch array values from the external memory. We propose an aggressive compilation scheme where array references from/to the external memory are overlapped with uninterrupted hardware evaluation of the synthesized loop's circuit. Such a pipeline mode of execution can take place only if:
- In every "circuit evaluation" the number of memory references needed as input to the next circuit evaluation is smaller than the number of operations in the current circuit.
- It is possible to program and compile code such that the set memory addresses needed for the "next circuit evaluation" can be computed in parallel during the "current circuit evaluation".
We have implemented a restricted programming language called DOL to verify these basic assumptions. DOL programs are easily synthesized to hardware and our experimental results give a preliminary evidence that such a compilation scheme can be used to compile large code segments into circuits including overlapping of hardware operations and memory references.
|
|
|
|
|