Tutorial 3

Core-Level Performance Engineering with the Open-Source Architecture Code Analyzer (OSACA) and the Compiler Explorer

April 16, 2023

Authors: Jan Laukemann (Friedrich-Alexander-Universität, Germany), Georg Hager (Friedrich-Alexander-Universität, Germany)

Jan Laukemann is a PhD student at the Professorship for High Performance Computing at Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). Previously he finished his Master¶s at FAU and worked as a Research Scientist at Intel Parallel Computing Labs (Intel PCL). He works on application optimization and performance engineering for HPC systems and novel algorithms for scalable linear algebra, tensor decomposition and graph computations. His research interests primarily include x86 and non-x86 computer architectures, their performance behavior on the node level, and vectorization techniques. He is the main developer of the Open Source Architecture Code Analyzer (OSACA), a static in-core kernel analysis tool, and is part of the organization committee of the annual HPC-AI Advisory Council Student Cluster Competition at ISC High Performance.

Georg Hager holds a PhD in Computational Physics from the University of Greifswald. Since 2021 he heads the Training and Support Division of the newly founded 3Erlangen National High Performance Computing Center (NHR@FAU). ́ Previously he was a senior researcher in the HPC Services group at Erlangen Regional Computing Center (RRZE), which is part of the University of Erlangen-Nürnberg. Recent research includes architecture-specific optimization strategies for current microprocessors, performance engineering of scientific codes, and analytic modeling of massively parallel programs. His textbook 3Introduction to High Performance Computing for Scientists and Engineers ́ is recommended or required reading in many HPC-related lectures and courses worldwide. He has more than two decades of experience in teaching high performance computing and performance engineering to students and scientists. Together with colleagues from NHR@FAU and other centers, he conducts long-standing series of tutorials on Node-Level Performance Engineering and Hybrid Programming.

Abstract

While many developers put a lot of effort into optimizing large-scale parallelism, they often neglect the importance of an efficient serial code. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted because no definite hardware performance limit (‘‘bottleneck’') is exhausted. This tutorial conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware on the level of a single CPU core and the lowest memory hierarchy level (the L1 cache). We introduce general out-of-order core architectures and their typical performance bottlenecks using modern x86-64 (Intel Ice Lake) and ARM (Fujitsu A64FX) processors as examples. We then go into detail about x86 and AArch64 assembly code, specifically including vectorization (SIMD), pipeline utilization, critical paths, and loop-carried dependencies. We also demonstrate performance analysis and performance engineering using the Open-Source Architecture Code Analyzer (OSACA) in combination with a dedicated instance of the well-known Compiler Explorer. Various hands-on exercises will allow attendees to make their own experiments and measurements and identify in-core performance bottlenecks. Furthermore, we show real-life use cases to emphasize how profitable in-core performance engineering can be.

Outline

  • 9:00 Introduction
  • 9:10 Basic processor and core architecture
    • Intel Ice Lake (Server) architecture
    • Scheduling in an out-of-order backend
  • 9:30 Terminology and code execution on out-of-order CPUs
    • Throughput, Latency, Critical Path and Loop-carried Dependencies
    • Hands-on: Out-of-order code execution
  • 10:30 Break
  • 11:00 x86 ISA introduction
    • Understanding scalar and vectorized assembly code
  • 11:30 Performance analysis of simple kernels
    • Example: STREAM Triad
    • Hands-on: Dot product
    • Hands-on: PI by integration
  • 12:30 OSACA introduction
    • How to use OSACA
    • How to use the Compiler Explorer
    • Analyze kernels using OSACA to find potential bottlenecks
  • 1:00 Lunch
  • 2:00 In-core analysis for Arm
    • Fujitsu A64FX core architecture
    • AArch64 ISA introduction
    • Understanding scalar and vectorized Arm assembly
  • 2:30 Case study: Sparse Matrix-Vector (SpMV) Multiplication on A64FX
  • 3:00 Case study: Lattice Quantum Chromodynamics (QCD) on A64FX
  • 4:00 Hands-on: 2D Gauss-Seidel on ICX
  • 4:45 Summary and take-home messages
  • 5:00 End of tutorial