The Case for Chip Multiprocessors based on the Data-Driven Multithreading Model


Prof. Paraskevas Evripidou
University of Cyprus, Nicosia, Cyprus

Tuesday, May 16, 2006
1:30PM-2:30PM California Time
4:30PM-5:30PM New York Time
9:30PM-10:30PM UK Time
10:30PM-11:30PM Central Europe Time
11:30PM-12:30AM Eastern Europe Time
5:30AM-6:30AM Japan Time, May 17
6:00AM-7:00AM Adelaide/Australia Time, May 17

The current challenge in Computer Architecture is how to effectively utilize the upcoming billions of transistors per chip while overcoming the Memory Wall problem and keeping power consumption in check. We make the case for a Chip-Multiprocessor (CMP) based on the Data-Driven Multithreading (DDM) model of execution. DDM is a multithreading model that effectively hides the communication and synchronization delay, thus overcoming the power wall problem. DDM employs deterministic perfecting of data that yields miss rates that are even lower than that of control flow execution, thus overcompensating for any locality loss due to the data-driven scheduling. CMP based on DDM avoids the complexity of other designs by combining several simple commodity microprocessors together with a small extra hardware structure, the Thread Scheduling Unit (TSU), thus lowering the power consumption per instruction.

To deliver performance as predicted by Moore's Law, computer architects have relied on the advancements of process technology, and improvements of the computer architecture and organization. While this approach has worked well in the past, it is currently only resulting in diminishing returns. The uniprocessor performance increase from 1986 to 2002 was 52% per year. Since then it is only about 20% per year. This is due to the inability of traditional architectures in surpassing two major obstacles: The Memory and Power walls. Both walls can be traced back to the von Neumman model of execution that has dominated the computer architecture field since the advent of digital computers. The memory wall problem is due to the imbalance between the speed of microprocessors and that of main memory, while the power wall is due to the high frequencies and complexity in modern microprocessors.

Current microprocessors have a high transistor density, execute at very high frequencies, include large cache memories, and rely heavily on out-of-order and speculative execution. The implementation of these techniques increases exponentially the concurrency needs at the gate level leading to the power consumption getting out of hand. Recently we have witness a significant decrease in the performance improvement per year. Consequently, we propose to use Data-Driven Multithreading (DDM), as it does not suffer from the above mentioned limitations. This is because DDM is not based on the von Neumman model of execution but instead on the data-flow model of execution. Thus, memory latencies can be tolerated without the huge performance penalty of the von Neumann model. Furthermore, the data-driven scheduling does not require the complexity of the multiple issue and out-of-order execution. Overall, the use of DDM results in a much lower Power-per-Instruction (PPI).

Data-Driven Multithreading is a non-blocking multithreading model based on the Decoupled Data-Driven model of execution. This model decouples the synchronization from the computation portions of a program allowing them to execute asynchronously. In this model a thread is scheduled for execution in a data-driven manner, i.e., whenever all of its required data have been produced. As a consequence, no synchronization or communication latencies are experienced in the processor critical path. The performance improvement is achieved at the expense of extra hardware. This hardware is responsible for the dynamic scheduling of threads in the multiprocessor system. We propose a DDM implementation that may be used with regular off-the-shelf microprocessors. Therefore it has the obvious benefit that a system may combine both DDM and the latest microprocessor technology. The core of the DDM implementation presented is a hardware module that is attached directly to the processor's bus. This module is responsible for the thread scheduling and is known as the Thread Synchronization Unit (TSU).

We have introduced CacheFlow to DDM, a cache management policy that uses the information provided by the DDM scheduling together with data forwarding and prefetching. Experiments over our entire test suite have shown average miss rate for the sequential code of 6.9%, the misses for DDM code jumped to 9.8% and the CacheFlow brought them down to 1.5%. Overall DDM achieved very respectable speedups; for the 32 processor configuration we had speedup of 26 when compared to single processor running native binaries produced from the C compiler without Multithreading. Our estimates for the DDM Chip multiprocessor are very encouraging. By extrapolating the results obtain from our research on DDM our analysis show that we for the same hardware budget we can achieve speedups of up to an order of magnitude higher compare to high-end commercial microprocessors available today.

Slides (PDF, 0.5MB)