Connect with us

International Circuit

Can Fujitsu beat Nvidia in the HPC race?

Arm processors on servers has gone from failed starts (Calxeda) to modest successes (ThunderX2) to real contenders (ThunderX3, Ampere). Now, details have emerged about Japanese IT giant Fujitsu’s Arm processor, which it claims will offer better HPC performance than Nvidia GPUs but at a lower power cost.

Fujitsu is developing the A64FX, a 48-core Arm8 derivative specifically engineered for high-performance computing (HPC). Rather than design general-purpose compute cores, Fujitsu has added compute engines specific to artificial intelligence, machine learning, and other technologies specific to the needs of HPC.

It will go in a new supercomputer called Fugaku, or Post-K. Post-K is a reference to the K supercomputer, at one time the fastest supercomputer in the world, that ran on custom Sparc chips before RIKEN Lab, where it was installed, pulled the plug.

Fujitsu has revealed some new details, and they are impressive. The design of the A64FX is a major departure from traditional design. Instead of the chiplet design of the AMD Epyc and some Xeons, it is a single monolithic design. More important, there are four chips of High Bandwidth Memory 2 (HBM2), an expensive but very fast memory used only in high-end systems, connected to the CPU. Two 8GB modules are placed on each side of the CPU.

Prototypes of the A64FX motherboard reveal it has no RAM DIMM sockets. An Intel or AMD motherboard will show up to a dozen memory DIMM sockets for each CPU but the A64FX motherboard has none. That’s because the A64FX has the HBM2 memory on the die for 32GB per CPU.

In HPC, memory bandwidth has been the bottleneck, and data intensive workloads like analytics, simulations, and machine learning are slowing them down. And much more power – up to 100 times as much – is used in moving data around in HPC than in actually processing it. So to achieve energy efficiency, data needs to move as little as possible.

So A64FX has a totally different design than your standard Arm or x86 chip. No system memory, just 32GB per processor of extremely fast memory directly connected to the chip via a high-speed interconnect instead of through a much slower memory bus. This will greatly reduce latency between CPU and memory and also reduce power because data doesn’t have to be moved in and out of memory sockets.

―Network World

Click to comment

You must be logged in to post a comment Login

Leave a Reply

Copyright © 2023 Communications Today

error: Content is protected !!