The recent Green500 list was dominated by heterogenous systems, and sitting at the top of the list were two Italian systems, built by Eurotech. Outfitted with an innovative hot water cooling system and NVIDIA Kepler GPUs, the machines were about 30 percent more energy-efficient than the previous Green500 champ, Beacon.
Green Computing Report spoke with Sumit Gupta, NVIDIA's General Manager of Tesla GPU Accelerated Computing, to find out where the GPU vendor is headed in terms of green computing.
The current Green500 chart-topper, named Eurora, relies on an Intel CPU paired with a slightly-modified Tesla K20 GPU. Gupta believes that the cooling apparatus on the Eurotech system exposed the energy-efficiency of the GPU the best. Because it is a liquid cooled system, NVIDIA supplied a specially designed board that uses a very thin plate. The Eurotech folks connected their plate, which has liquid running through it, to the NVIDIA plate.
"It's not a special SKU as much as it is a SKU designed for liquid cooling," remarked Gupta of this modified K20.
The top-of-the-line K20x GPU, however, offers even more FLOPS/watt than the K20 (5.6 gigaflops/watt versus 4.98 gigaflops/watt). Gupta agrees that this would be a good option for a system vendor looking to achieve greater energy efficiency.
As to whether NVIDIA's roadmap will produce further FLOPS/watt increases, Gupta replied that thinking of power and performance together is in the company's DNA. That sensibility comes from the mobile world, where power limitations are a given. The standard in high-end computing might be to get the most out of a 225 watt power envelope, whereas the Tegra guys (NVIDIA's mobile group) start with the premise that they have zero watts. Every watt that they add is a big deal.
"We always think about performance-per-watt in every design we do," notes Gupta. To illustrate this principle, Gupta mentions that the current Kepler GPUs that are powering HPC systems are going into NVIDIA's next cell phone chip, codenamed Logan. "This gives you the power or energy sensibility that we have when designing GPUs," he adds.
The Eurora system is built for real applications, so it's not a stunt machine, but at number 467, it is not that powerful compared to other TOP500 machines. The real challenge for the HPC community is increasing the energy efficiency of petascale-plus machines with an eye toward creating an exascale machine. With this in mind, it's worrisome that out of the top 10 systems on the TOP500 list, none made it into the top 10 of the Green500.
Still Gupta is optimistic about the design of the Eurotech systems as far as scalability goes. "I don't foresee any problems with scaling it. I would say the next opportunity for optimization is to have a much better interconnect. Using richer InfiniBand and more complex network hierarchies will improve the overall efficiency and throughput of the system," he said.
Mirroring an increasing community sentiment, Gupta is skeptical when it comes to using the LINPACK benchmark as the primary measurement. The Green500 and the TOP500 are both based on this metric, but Gupta argues that real applications are where energy efficiency really counts. People aren't running LINPACK on a day to day basis, and LINPACK is the most power consuming application because it basically maximizes GPU utilization, he says.
"With real applications, we are able to get much higher performance capability," notes Gupta. "In reality, LINPACK takes 225 watts per GPU, but a real application is running well below that. It's running at 200 watts or sometimes 150 watts only, so there's much more performance per watt because the energy footprint has gone down."
It's similar to driving a car really fast. At higher speeds, energy consumption goes up exponentially. Gupta offers a comparison between GPUs and CPUs. He says with CPUs to get good performance they need to run very fast, while GPUs have thousands of cores running at slow speeds. CPU clocks are are running at 2-3 gigahertz or higher, whereas GPU cores are running at 700 megahertz (.7 gigahertz). The average speed of a GPU is about one-fourth the speed of a CPU core.
General purpose GPU (GPGPU) computing was instrumental in driving the first petascale systems, and the fastest and greenest systems are increasingly relying on accelerators, like NVIDIA GPUs, and coprocessors, like Intel Xeon Phi.
Horst Simon recently said that he foresees that by 2015 all of the top 10 systems will use accelerators.
"You can't build these systems at this level without accelerators because of the energy efficiency alone," asserts Gupta. "Building Titan, a 27-petaflop machine, with only CPUs would have required about four times more power, making the power bill per year higher than the cost of the system. It's not feasible." The other really interesting data point, notes Gupta, is that the best x86-only system on the Green500 is number 39.
When the conversation turned to converged chip architectures, Gupta was critical of the "Knights Landing" Xeon Phi x86 coprocessor. He said that there are always pros and cons. In Intel's case, i.e., the Knight's Landing chip, he said these are very weak CPU cores, and if you use one of these cores to run the primary application, the weakest link becomes that weak core, per Amdahl's Law, creating the risk that it could run slower.
"You'd have to find applications that take advantage of these integrated CPU cores, because if you have a weak CPU core, like Knights Landing, you might actually run slower than regular CPU cores, so it's not as much of a slam-dunk as people think it will be," notes Gupta.
When it comes to the ARM architecture, however, the NVIDIA rep is bullish. He predicts that ARM is going to give Intel a serious run for their money.
NVIDIA is CPU-agnostic, and can work with x86 or ARM. The GPU vendor recently announced CUDA support for ARM and ARM is a key part of their mobile investment strategy. Their next mobile chip, Logan, is an integrated solution with ARM plus a CUDA GPU.
Researchers are already pursuing CUDA on ARM as a potential exascale strategy. John Stone, associate director of the CUDA Center of Excellence at University of Illinois at Urbana-Champaign, shared the following insight in a recent interview.
"There are many challenges to overcome before exascale computing becomes a reality, but the power consumption issue is fundamental, and the processors found in future supercomputers may have a great deal in common with the energy-efficient mobile platforms of today. Having CUDA available on ARM allows researchers like us to begin preparing our codes for future energy-efficient HPC platforms, working past stumbling blocks, and developing new algorithms," said Stone.