AMD’s MI300 APUs to Power Exascale El Capitan Supercomputer
Additional details of the architecture of the exascale El Capitan supercomputer were disclosed today by Lawrence Livermore National Laboratory’s (LLNL) Terri Quinn in a presentation delivered to the 79th HPC User Forum today at Oak Ridge National Laboratory (ORNL). Quinn revealed that AMD’s forthcoming MI300 APU would be the computational bedrock of El Capitan, which is slated for installation at LLNL in late 2023.
El Capitan, an HPE-built, AMD-powered exascale system, is on track to deliver roughly 2 exaflops of peak performance courtesy of tightly packaged AMD GPUs and CPUs bundled into HPE Cray XE racks and tied together with Slingshot-11 networking.
Now, we know a bit more about that AMD hardware: El Capitan will be powered by AMD’s forthcoming MI300 APUs.
“It’s the first time we’ve publicly stated this,” said Quinn, associate director for HPC at LLNL. “I cut these words out of [AMD’s] investors document and that’s what it says: it’s a 3D chiplet design with AMD CDNA3 GPUs, Zen 4 CPUs, cache memory and HBM chiplets.”
“I can’t give you all the specs, but [El Capitan] is at least 10× greater than Sierra performance on average figures of merit,” said Quinn. “Theoretical peak is two double-precision exaflops, [and we’ll] keep it under 40 megawatts—same reason as Oak Ridge, the operating cost.”
The MI300 APU is the follow-on to the MI200 GPU, the powerhouse of the exascale Frontier supercomputer that is in early operations at Oak Ridge. The culmination of AMD’s expertise in processors, packaging and fabric technology, the MI300 will leverage AMD’s fourth generation Infinity architecture, which enables 2.5D and 3D chiplet integration with unified system level coherency. It will be implemented with AMD’s next-generation Zen 4 “Genoa” CPU, the successor to Milan.
At AMD’s 2022 Financial Analyst Day earlier this month, AMD’s Forrest Norrod said that the MI300 APU would deliver 8× more training performance using its new FP8 numerical format. Further, AMD reported that Genoa (the CPU engine inside El Capitan) will deliver 75 percent higher Java Enterprise performance.
Quinn also said that while Livermore has traditionally used the vendor’s system software and management software, they are switching over to use the custom NNSA Tri-Lab Operating System Stack (TOSS) that they’ve used for traditional Linux clusters.
It makes sense, said Quinn, if you think of El Capitan as, essentially, a big Linux cluster. “It’s not thousands of nodes, it’s [on the] order of nodes that we get on our other clusters. So we put it on here, and we can have every one of our systems in our center running exactly the same operating system, so it’s a big attraction for us.”
Livermore is already operating a number of test and development systems in preparation for receiving El Capitan. Three such early access systems — rzVernal, Tioga and Tenaya — ranked in the top 200 of the recent Top500 list. All are built by HPE using “Frontier-style” blades that consist of AMD MI250X GPUs, Milan CPUs and Slingshot-11 networking inside Cray EX cabinets.
LLNL operates under the auspices of the National Nuclear Security Administration (NNSA), which strives to bolster national security through the military applications of nuclear science. To that end, Quinn explained that the US’ “global competitors” have been modernizing their nuclear weapons stockpiles in recent years. “The US has relied on extending the life of our nuclear weapons,” she continued. “But we’ve kind of reached the point where the government wants to look at modernizing our stockpile[.] El Capitan was planned to deliver on that mission.”
Last month, NNSA completed the Exascale Computing Facility Modernization (ECFM) Project at Livermore after initiating the project in March 2020. Quinn explained that the existing facility had sufficient size (48,000 square feet) and structural integrity (625 pounds per square foot), but needed an overhaul to support the sites’ mission of enabling the operation of two exascale-class systems simultaneously. The electrical supply was upgraded from 45MW to 85MW and cooling was scaled to accommodate 28,000 tons of water, including a new 18,000-ton cooling tower.
Installation on El Capitan is slated for 2023, with the aim of putting NNSA workloads into production in the second quarter of 2024. The system is expected to operate from 2024 to 2030.
The exascale field
NNSA’s El Capitan system is currently third in the US exascale queue. Front-runner Frontier — also an HPE-AMD-DOE collab — just became the first supercomputer to cross the Linpack exaflops threshold, clocking 1,102 exaflops of performance on the spring 2022 Top500 list. On the Green500, Frontier rides shotgun to its companion test and development system (Frontier TDS, aka Borg), which took the number one spot on the Green500 list. (Borg is #29 on the Top500 with 19.2 petaflops.)
Argonne National Laboratory is awaiting completion of the Aurora supercomputer, a 2-exaflops HPE-Intel machine that has undergone several reconceptualizations† Aurora’s execution is also potentially beset by additional delays pertaining to its Sapphire Rapids CPU, so the exact timeline is fuzzy — but, reportedly, installation is underway.
Oliver Peckham contributed to this report.