#### ISCA 23

#### A Research Retrospective on the AMD Exascale Computing Journey

Gabriel H. Loh Shaizeen Aga Johnathan Alsop Sergey Blagodurov Noel Chalmers David Cownie Alexandru Duțu Joseph L. Greathouse Sachin Hossamani John Kalamatianos Daniel Lowell Srilatha Manne Michael Mishkin **Brandon** Potter Karthik Rao John Slice Abhinav Vishnu Mark Wyse

Michael J. Schulte **Derrick** Aguren Paul T. Bauman **Travis Boraten** Shaoming Chen Nicholas Curtis Yasuko Eckert Sudhanva Gurumurthi Wei Huang Onur Kayiran Niti Madan Susumu Mashimo Mark Nutter Kishore Punniyamurthy **Gregory Rodgers** Vilas Sridharan Samuel Wasmundt Adithya Yalavarti Advanced Micro Devices, Inc.

Mike Ignatowski Varun Agrawal Bradford M. Beckmann Michael Boyer Kevin Cheng Joris Del Pino Christopher Erb Anthony Gutierrez Mahzabeen Islam Jagadish Kotra Abhinandan Majumdar Damon McDougall Indrani Paul Sooraj Puthoor Marko Scrbak Rene van Oostrum Mark Wilkening Dmitri Yudanov

Vignesh Adhinarayanan Ashwin M. Aji Majed Valad Beigi William C. Brantley Michael L. Chu Nam Duong Chip Freitag Khaled Hamidouche Nuwan Jayasena Alan Lee Nicholas Malaya Elliot Mednick Matthew Poremba Steven E. Raasch Mohammad Seyedzadeh Eric van Tassell Noah Wolfe



### **Frontier: Exploring Exascale**

#### The System Architecture of the First Exascale Supercomputer

Scott Atchley Oak Ridge National Laboratory Oak Ridge, TN, USA atchleyes@ornl.gov

David E. Bernholdt Oak Ridge National Laboratory Oak Ridge, TN, USA bernholdtde@ornl.gov

Michael J. Brim Oak Ridge National Laboratory Oak Ridge, TN, USA brimmj@ornl.gov

Markus Eisenbach Oak Ridge National Laboratory Oak Ridge, TN, USA eisenbachm@ornl.gov

Nicholas Frontiere Argonne National Laboratory Lemont, IL, USA nfrontiere@anl.gov Chris Zimmer Oak Ridge National Laboratory Oak Ridge, TN, USA zimmercj@ornl.gov

Verónica G. Melesse Vergara Oak Ridge National Laboratory Oak Ridge, TN, USA vergaravg@ornl.gov

Reuben Budiardja Oak Ridge National Laboratory Oak Ridge, TN, USA reubendb@ornl.gov

Thomas Evans Oak Ridge National Laboratory Oak Ridge, TN, USA evanstm@ornl.gov

Antigoni Georgiadou Oak Ridge National Laboratory Oak Ridge, TN, USA georgiadoua@ornl.gov John R. Lange Oak Ridge National Laboratory Oak Ridge, TN, USA University of Pittsburgh Pittsburgh, PA, USA langejr@ornl.gov

Thomas Beck Oak Ridge National Laboratory Oak Ridge, TN, USA becktl@ornl.gov

Sunita Chandrasekaran University of Delaware Newark, DE, USA schandra@udel.edu

Matthew Ezell Oak Ridge National Laboratory Oak Ridge, TN, USA ezellma@ornl.gov

Joe Glenski Hewlett Packard Enterprise Bloomington, MN, USA glenski@hpe.com Philipp Grete Universität Hamburg Hamburg, Germany pgrete@hs.uni-hamburg.de

Axel Huebl Lawrence Berkeley National Laboratory Berkeley, CA, USA axelhuebl@lbl.gov

Kim McMahon Hewlett Packard Enterprise Bloomington, MN, USA kim.mcmahon@hpe.com

Andrew Myers Lawrence Berkeley National Laboratory Berkeley, CA, USA atmyers@lbl.gov

Thomas Papatheodore Oak Ridge National Laboratory Oak Ridge, TN, USA papatheodore@ornl.gov

> Evan Schneider University of Pittsburgh Pittsburgh, PA, USA eschneider@pitt.edu

Steven Hamilton Oak Ridge National Laboratory Oak Ridge, TN, USA hamiltonsp@ornl.gov

Daniel Jacobson Oak Ridge National Laboratory Oak Ridge, TN, USA jacobsonda@ornl.gov

Elia Merzari Pennsylvania State University University Park, PA, USA ebm5351@psu.edu

Stephen Nichols Oak Ridge National Laboratory Oak Ridge, TN, USA nicholsss@ornl.gov

Danny Perez Los Alamos National Laboratory Los Alamos, NM, USA danny\_perez@lanl.gov

Jean-Luc Vay Lawrence Berkeley National Laboratory Berkeley, CA, USA jlvay@lbl.gov John Holmen Oak Ridge National Laboratory Oak Ridge, TN, USA holmenjk@ornl.gov

Wayne Joubert Oak Ridge National Laboratory Oak Ridge, TN, USA joubert@ornl.gov

Stan G Moore Sandia National Laboratory Albuquerque, NM, USA stamoor@sandia.gov

Sarp Oral Oak Ridge National Laboratory Oak Ridge, TN, USA oralhs@ornl.gov

David M. Rogers Oak Ridge National Laboratory Oak Ridge, TN, USA rogersdm@ornl.gov

P.K. Yeung Georgia Institute of Technology Atlanta, GA, USA pk.yeung@ae.gatech.edu



# Oak Ridge National Laboratory The world's premier research institution

- Energy
- Biology
- Neutron science
- Materials
- Security
- High-performance computing



[1]https://www.ornl.gov/ [2]https://en.wikipedia.org/wiki/Oak\_Ridge\_National\_Laboratory

# 2011 the United States Department of Energy (DOE)

- Request for Information (RFI)
- & Request for Purpose (RFP)
- Codesign : Technology providers collaborate closely with scientists and technologists from DOE
- Application Readiness : To ensure the applications are compatible and functional
- Power-performance Efficency needed to be a top priority



Figure 1. (a) Exascale timeline and (b) system objectives from the 2011 U.S. DOE exascale research and development Request for Information.

# **2023 Retrospective view**

To innovate and accelerate the necessary exascale technologies, a series of programs between the DOE and technology companies covering processors, memory, storage, networking, and software are funded.



Figure 2. Timeline illustrating U.S. DOE exascale R&D programs and milestones (bottom) and key AMD technology introductions (top).

# 2012 Exascale Heterogeneous Processor (EHP) V1

Research community was highly concerned with the (fear to be) imminent end of DRAM scaling.



To maximize the compute and minimize the cost of data movement, 3D stacking is adopted.



Figure 3. (a) Block diagram of the Exascale Heterogeneous Processor (EHP) concept from the original FastForward program circa 2012, (b) illustrative packaging view of the EHP.

### **2014 EHP V2**

Normalized Cost per Yielded mm2

To reduce silicon cost, AMD adopts chiplet technology to reuse silicon components in multiple product configurations.

AMD packaging engineers also raised concerns about the asymmetry of the overall package (all CPU on one side, all GPU on the other).



Figure 4. Silicon cost trends over time and (b) an AMD EPYC<sup>TM</sup> processor utilizing chiplets.



Figure 5. Refinement of the EHP (v2), circa 2014.

# 2016 EHP V3

- The power density of the GPU regions still present thermal challenges
- While technically feasible, the "triple stack" of DRAM on GPU on active interposer also significantly increases the manufacturing complexity



### 2018 EHP V4

The higher bandwidth required to support data movement and work distribution among the GPU compute units would be far less efficient to route among the larger number of chiplets



Figure 6. Refinement of the EHP (v3), circa 2016.

Figure 7. Refinement of the EHP (v4), circa 2018.

# **APU VS Discrete Node Architecture**

APU:Combine a general-purpose CPU and a GPU on a single die

- Enable faster data transfer and communication between the CPU and GPU
- Reduce the power consumption since both share the same die and memory



More scalability and customizable:

• Customize platforms to provide different CPU-to-GPU ratios as well as to interoperate with other components



Figure 8. Discrete Node Architecture consisting of interconnected CPUs (left) and accelerators (right).

# **Overview of one Frontier Computer Node**

NIC

9408 compute nodes housed in 74 cabinets

#### 64-core EPYC<sup>тм</sup> 7А53 CPU(3rd EPYC)

MI250X (CDNA2)

#### **Infinity Fabric link**



NIC

Figure 10. Block diagram of one Frontier Compute Node with peak theoretical memory and interconnect speeds. The "X+X GB/s" notation indicates X GB/s of bandwidth each for send and receive.

# EPYC 7003 System on Chip (SoC)

CCD: Core complex die

**GMI:** Global memory interface



Figure 1-5: EPYC 7003 System on Chip (SoC): 8 CCDs and central IOD

# **Evaluation of AMD EPYC 7A53 "Trento" CPU.**

- Trento is able to achieve up to 180 GB/s using non-temporal loads and stores in NPS-4 mode. When operating in NPS-1, that rate drops to ~ 125 GB/s.
- Table 3 illustrates how caching can negatively affect bandwidth when data are not expected to fit into cache.

| Function | Temporal (MB/s) | Non-Temporal (MB/s) |  |  |
|----------|-----------------|---------------------|--|--|
| Copy     | 176780.4        | 179130.5            |  |  |
| Scale    | 107262.2        | 172396.2            |  |  |
| Add      | 125567.1        | 178356.8            |  |  |
| Triad    | 120702.1        | 178277.0            |  |  |

Table 3: CPU STREAM bandwidth results using temporal and non-temporal stores.

### **MI250X**



### Comparison between AMD and NVIDIA GPUs MI300X is released TODAY !!!

| Manufacturer            | turer AMD    |                |                 | NVIDIA         |                 |
|-------------------------|--------------|----------------|-----------------|----------------|-----------------|
| Product                 | <b>MI100</b> | MI250X         | <b>MI300X</b>   | A100           | H100            |
| <b>Release Time</b>     | 2020.11      | 2021.11        | 2023.12         | 2020.5/11      | 2022.3          |
| <b>FP64</b>             | 11.5 TF      | <b>95.7 TF</b> | 163.4 TF        | <b>19.5 TF</b> | 66.9 TF         |
| <b>TF32</b>             | N/A          | N/A            | <b>490.3 TF</b> | 156 TF         | <b>494.7 TF</b> |
| <b>BF16</b>             | 92.3 TF      | 383 TF         | 1307.4 TF       | <b>312 TF</b>  | 989.4 TF        |
| <b>FP16</b>             | 184.6 TF     | 383 TF         | 1307.4 TF       | <b>312 TF</b>  | 989.4 TF        |
| FP8                     | N/A          | N/A            | 2614.9 TF       | N/A            | 1978.9 TF       |
| INT8                    | 184.6 TF     | 383 TF         | 2614.9 TF       | 624 TF         | 1978.9 TF       |
| <b>Memory Size</b>      | 32 GB        | 128 GB         | 192 GB          | 40/80 GB       | 80 GB           |
| <b>Memory Bandwidth</b> | 1.2 TB/s     | 3.2 TB/s       | 5.3 TB/s        | 1.5/2.0 TB/s   | 3 TB/s          |

**TF:TFLOPS** stands for matrix/tensore core computation(dense) throughput

# **MI300 Series**

#### **Block diagram of the AMD Instinct MI300A and MI300X**



# THANKS & QA