Loading [MathJax]/extensions/tex2jax.js
MFC
High-fidelity multiphase flow simulation
All Files Pages
Performance

MFC has been benchmarked on several CPUs and GPU devices. This page is a summary of these results.

Figure of merit: Grind time performance

The following table outlines observed performance as nanoseconds per grid point (ns/gp) per equation (eq) per right-hand side (rhs) evaluation (lower is better), also known as the grind time. We solve an example 3D, inviscid, 5-equation model problem with two advected species (8 PDEs) and 8M grid points (158-cubed uniform grid). The numerics are WENO5 finite volume reconstruction and HLLC approximate Riemann solver. This case is located in examples/3D_performance_test. You can run it via ./mfc.sh run -n <num_processors> -j $(nproc) ./examples/3D_performance_test/case.py -t pre_process simulation --case-optimization for CPU cases right after building MFC, which will build an optimized version of the code for this case then execute it. For benchmarking GPU devices, you will likely want to use -n <num_gpus> where <num_gpus> should likely be 1. If the above does not work on your machine, see the rest of this documentation for other ways to use the ./mfc.sh run command.

Results are for MFC v4.9.3 (July 2024 release), though numbers have not changed meaningfully since then. Similar performance is also seen for other problem configurations, such as the Euler equations (4 PDEs). All results are for the compiler that gave the best performance. Note:

  • CPU results may be performed on CPUs with more cores than reported in the table; we report results for the best performance given the full processor die by checking the performance for different core counts on that device. CPU results are the best performance we achieved using a single socket (or die).
  • GPU results are for a single GPU device. For single-precision (SP) GPUs, we performed computation in double-precision via conversion in compiler/software; these numbers are not for single-precision computation. AMD MI250X and MI300A GPUs have multiple compute dies per socket; we report results for one GCD* for the MI250X and the entire APU (6 XCDs) for MI300A, though one can quickly estimate full device runtime by dividing the grind time number by the number of GCDs on the device (the MI250X has 2 GCDs). We gratefully acknowledge the permission of LLNL, HPE/Cray, and AMD for permission to release MI300A performance numbers.
Hardware Details Type Usage Grind Time [ns] Compiler Computer
NVIDIA GH200 GPU only APU 1 GPU 0.32 NVHPC 24.1 GT Rogues Gallery
NVIDIA H100 SXM5 GPU 1 GPU 0.38 NVHPC 24.5 GT ICE
NVIDIA H100 PCIe GPU 1 GPU 0.45 NVHPC 24.5 GT Rogues Gallery
AMD MI300A APU 1 APU 0.60 CCE 18.0.0 LLNL Tioga
NVIDIA A100 GPU 1 GPU 0.62 NVHPC 22.11 GT Phoenix
NVIDIA V100 GPU 1 GPU 0.99 NVHPC 22.11 GT Phoenix
NVIDIA A30 GPU 1 GPU 1.1 NVHPC 24.1 GT Rogues Gallery
AMD MI250X GPU 1 GCD* 1.1 CCE 16.0.1 OLCF Frontier
AMD EPYC 9965 Turin, Zen5c CPU 192 cores 1.2 AOCC 5.0.0 AMD Volcano
AMD MI100 GPU 1 GPU 1.4 CCE 16.0.1 Cray internal system
AMD EPYC 9755 Turin, Zen5 CPU 128 cores 1.4 AOCC 5.0.0 AMD Volcano
Intel Xeon 6980P Granite Rapids CPU 128 cores 1.4 Intel 2024.2 Intel Endeavour
NVIDIA L40S FP32-only GPU GPU 1 GPU 1.7 NVHPC 24.5 GT ICE
AMD EPYC 9654 Genoa, Zen4 CPU 96 cores 1.7 Intel 2021.9 DOD Carpenter
Intel Xeon 6960P Granite Rapids CPU 72 cores 1.7 Intel 2024.2 Intel AI Cloud
NVIDIA P100 GPU 1 GPU 2.4 NVHPC 23.5 GT CSE Internal
Intel Xeon 8592+ Emerald Rapids CPU 64 cores 2.6 Intel 2024.2 Intel AI Cloud
Intel Xeon 6900E Sierra Forest Adv., 2.8GHz Boost, 384 MiB L3 CPU 192 cores 2.6 Intel 2024.2 Intel AI Cloud
AMD EPYC 9534 Genoa, Zen4 CPU 64 cores 2.7 GNU 12.3.0 GT Phoenix
NVIDIA A40 FP32-only GPU GPU 1 GPU 3.3 NVHPC 22.11 NCSA Delta
Intel Xeon Max 9468 Sapphire Rapids HBM CPU 48 cores 3.5 NVHPC 24.5 GT Rogues Gallery
NVIDIA Grace CPU Arm, Neoverse V2 CPU 72 cores 3.7 NVHPC 24.1 GT Rogues Gallery
NVIDIA RTX6000 FP32-only GPU GPU 1 GPU 3.9 NVHPC 22.11 GT Phoenix
AMD EPYC 7763 Milan, Zen3 CPU 64 cores 4.1 GNU 11.4.0 NCSA Delta
Intel Xeon 6740E Sierra Forest CPU 92 cores 4.2 Intel 2024.2 Intel AI Cloud
NVIDIA A10 FP32-only GPU GPU 1 GPU 4.3 NVHPC 24.1 TAMU Faster
AMD EPYC 7713 Milan, Zen3 CPU 64 cores 5.0 GNU 12.3.0 GT Phoenix
Intel Xeon 8480CL Sapphire Rapids CPU 56 cores 5.0 NVHPC 24.5 GT Phoenix
Intel Xeon 6454S Sapphire Rapids CPU 32 cores 5.6 NVHPC 24.5 GT Rogues Gallery
Intel Xeon 8462Y+ Sapphire Rapids CPU 32 cores 6.2 GNU 12.3.0 GT ICE
Intel Xeon 6548Y+ Emerald Rapids CPU 32 cores 6.6 Intel 2021.9 GT ICE
Intel Xeon 8352Y Ice Lake CPU 32 cores 6.6 NVHPC 24.5 GT Rogues Gallery
Ampere Altra Q80-28 Arm, Neoverse-N1 CPU 80 cores 6.8 GNU 12.2.0 OLCF Wombat
AMD EPYC 7513 Milan, Zen3 CPU 32 cores 7.4 GNU 12.3.0 GT ICE
Intel Xeon 8268 Cascade Lake CPU 24 cores 7.5 Intel 2024.2 TAMU ACES
AMD EPYC 7452 Rome, Zen2 CPU 32 cores 8.4 GNU 12.3.0 GT ICE
NVIDIA T4 FP32-only GPU GPU 1 GPU 8.8 NVHPC 24.1 TAMU Faster
Intel Xeon 8160 Skylake CPU 24 cores 8.9 Intel 2024.0 TACC Stampede3
IBM Power10 CPU 24 cores 10 GNU 13.3.1 GT Rogues Gallery
AMD EPYC 7401 Naples, Zen(1) CPU 24 cores 10 GNU 10.3.1 LLNL Corona
Intel Xeon 6226 Cascade Lake CPU 12 cores 17 GNU 12.3.0 GT ICE
Apple M1 Max CPU 10 cores 20 GNU 14.1.0 N/A
IBM Power9 CPU 20 cores 21 GNU 9.1.0 OLCF Summit
Cavium ThunderX2 Arm CPU 32 cores 21 GNU 13.2.0 SBU Ookami
Arm Cortex-A78AE Arm, BlueField3 CPU 16 cores 25 NVHPC 24.5 GT Rogues Gallery
Intel Xeon E5-2650V4 Broadwell CPU 12 cores 27 NVHPC 23.5 GT CSE Internal
Apple M2 CPU 8 cores 32 GNU 14.1.0 N/A
Intel Xeon E7-4850V3 Haswell CPU 14 cores 34 GNU 9.4.0 GT CSE Internal
Fujitsu A64FX Arm CPU 48 cores 63 GNU 13.2.0 SBU Ookami

All grind times are in nanoseconds (ns) per grid point (gp) per equation (eq) per right-hand side (rhs) evaluation, so X ns/gp/eq/rhs. Lower is better.

Weak scaling

Weak scaling results are obtained by increasing the problem size with the number of processes so that work per process remains constant.

AMD MI250X GPU

MFC weask scales to (at least) 65,536 AMD MI250X GPUs on OLCF Frontier with 96% efficiency. This corresponds to 87% of the entire machine.

NVIDIA V100 GPU

MFC weak scales to (at least) 13,824 V100 NVIDIA V100 GPUs on OLCF Summit with 97% efficiency. This corresponds to 50% of the entire machine.

IBM Power9 CPU

MFC Weak scales to 13,824 Power9 CPU cores on OLCF Summit to within 1% of ideal scaling.

Strong scaling

Strong scaling results are obtained by keeping the problem size constant and increasing the number of processes so that work per process decreases.

NVIDIA V100 GPU

The base case utilizes 8 GPUs with one MPI process per GPU for these tests. The performance is analyzed at two problem sizes: 16M and 64M grid points. The "base case" uses 2M and 8M grid points per process.

16M Grid Points

64M Grid Points

IBM Power9 CPU

CPU strong scaling tests are done with problem sizes of 16, 32, and 64M grid points, with the base case using 2, 4, and 8M cells per process.