[ad_1]

When initially conceived, Japan’s Publish-Okay supercomputer was alleged to be the nation’s first exascale system. Developed by Fujitsu and the RIKEN Heart for Computational Science, the system, now referred to as Fugaku, is designed to be two orders of magnitude quicker than its predecessor, the 11.Three-petaflops (peak) Okay pc. However a humorous factor occurred on the way in which to exascale. By the point the silicon mud had settled on the A64FX, the chip that may energy Fugaku, it had morphed right into a pre-exascale system.
The present estimate is that the RIKEN-bound supercomputer will prime out at about 400 peak petaflops at double precision. Provided that the system has to slot in a 30 MW to 40 MW energy envelope, that’s about all you may squeeze out of the 150,000 single-socket nodes that may make up the machine. Which is definitely reasonably spectacular. The A64FX prototype machine, aka “micro-Fugaku,” is presently essentially the most energy-efficient supercomputer on this planet, delivering 16.9 gigaflops per watt. Nonetheless, extrapolating that out to an exaflop machine with those self same (or very related) processors would require one thing approaching 60 MW to 80 megawatts.
However in response to Satoshi Matsuoka, director of the RIKEN lab, the efficiency aim of reaching two orders of magnitude enchancment over the Okay pc shall be achieved from an utility efficiency perspective. “That was the plan from the start,” Matsuoka tells The Subsequent Platform.
To suggest that 100-fold utility increase quantities to exascale functionality is a little bit of stretch, but when Fugaku successfully performs at that degree relative to the efficiency of functions on the Okay machine, that’s most likely extra necessary to RIKEN customers. It must be identified that not all functions are going to get pleasure from that magnitude of speedup. The desk under illustrates the anticipated efficiency increase for 9 goal functions relative to the Okay pc.

Although Fugaku has solely 20 occasions the uncooked efficiency and power effectivity of its predecessor, the 100X efficiency enchancment is the defining metric, says Matsuoka. That type of overachievement (once more, on some codes) is the results of sure capabilities baked into the A64FX silicon, specifically using Arm’s Scalable Vector Extension (SVE), which supplies one thing akin to an built-in 512-bit-wide vector processor on-chip, delivering about three teraflops of peak oomph.
Maybe much more important is the 32 GB of HBM2 stacked reminiscence glued onto the A64FX bundle, which delivers 29X the bandwidth of the reminiscence system on the Okay pc. The selection to dispense with standard reminiscence and go completely with HBM2 was the results of the popularity that many HPC functions lately are memory-bound reasonably than compute sure. The truth is, reaching higher steadiness between flops and reminiscence bandwidth was a key design level for Fugaku. The compromise right here is that 32 GB will not be a lot capability, particularly for functions that must work with actually giant datasets.
The opposite side of Fugaku that would earn it exascale avenue cred is within the realm of decrease precision floating level. Though the system will ship 400 peak petaflops at double precision (FP64), it can present 800 petaflops at single precision (FP32) and 1.6 exaflops at half precision (FP16). The half precision assist alludes to AI functions that may make in depth use of 16-bit floating level arithmetic to construct synthetic neural networks. Fugaku might even handle to hit an exaflop or higher on the HPL-AI benchmark, which makes in depth use of FP16 to run for Excessive Efficiency Linpack (HPL).
When run on the 200 petaflops “Summit” machine at Oak Ridge Nationwide Laboratory, HPL-AI delivered 445 petaflops on Linpack, which was thrice quicker than the consequence carried out solely with FP64. Extra to the purpose, if the identical iterative refinement strategies utilizing FP16 can be utilized on actual functions, it’s doable that precise HPC codes might be accelerated to exascale ranges.
The extra simple use of decreased precision math, using each FP16 and FP32, is for coaching AI fashions. Once more, work on Summit proved that decrease precision math might attain exascale-level computing on these machines. On this explicit case, builders employed the Tensor Cores on the system’s V100 GPUs to make use of a neural community to categorise excessive climate patterns, reaching peak efficiency of 1.13 exaops and sustained efficiency of zero.999 exaops.
Whether or not decreased precision exaflops or exaops qualifies as exascale computing is a semantic train greater than anything. In fact, that’s not going to be very satisfying for pc historians and even for analysts and journalists making an attempt to trace HPC functionality in real-time.
However maybe that’s appropriately as. The attainment of a specific peak efficiency or Linpack efficiency numbers does little to tell the state of supercomputing. And given the rising significance of AI workloads, which aren’t primarily based on 64-bit computing, it’s not stunning that HPC is shifting away from these simplistic measures. The anticipated emergence of neuromorphic and quantum computing within the coming decade will additional muddy the waters.
That stated, customers will proceed to rely totally on 64-bit flops to run HPC simulations, which is able to proceed to be closely utilized by the scientists and engineers for the foreseeable future.
With that in thoughts, RIKEN is already planning for its post-Fugaku system, which Matsuoka says is tentatively scheduled to make its look in 2028. In line with him, RIKEN is planning on doing an evaluation on the way it can construct one thing 20X extra highly effective than Fugaku. He says the problem is that present applied sciences received’t extrapolate to such a system in any sensible method. Which as soon as once more means they should innovate on the architectural degree, however this time with out the good thing about Moore’s Regulation.
[ad_2]
Source link









