MFC
High-fidelity multiphase flow simulation
Loading...
Searching...
No Matches
GPU Parallelization

MFC compiles GPU code via OpenACC and in the future OpenMP as well.

In order to swap between OpenACC and OpenMP, custom GPU macros are used that translate to equivalent OpenACC and OpenMP directives. FYPP is used to process the GPU macros.

OpenACC Quick start Guide

OpenACC API Documentation


Macro API Documentation

Note: Ordering is not guaranteed or stable, so use key-value pairing when using macros

Data Type Meanings

  • Integer is a number
  • Boolean is a pythonic boolean - Valid options: True or False
  • String List is given as a comma separated list surrounding by brackets and inside quotations
    • Ex: '[hello, world, Fortran]'
  • 2-level string list is given as a comma separated list of string lists surrounding by brackets and inside quotations
    • Ex: '[[hello, world], [Fortran, MFC]]' or '[[hello]]'

Data Flow

  • Data on the GPU has a reference counter
  • When data is referred to being allocated, it means that GPU memory is allocated if it is not already present in GPU memory. If a variable is already present, the reference counter is just incremented.
  • When data is referred to being deallocated, it means that the reference counter is decremented. If the reference counter is zero, then the data is actually deallocated from GPU memory
  • When data is referred to being attached, it means that the device pointer attaches to target if it not already attached. If pointer is already attached, then the attachment counter is just incremented
  • When data is referred to being detached, it means that the attachment counter is decremented. If attachment counter is zero, then actually detached

Computation Macros

GPU_PARALLEL_LOOP(Execute the following loop on the GPU in parallel)

Macro Invocation

In order to parallelize a loop, simply place two macro calls on either end of the loop:

$:$GPU_PARALLEL_LOOP(...)
{code}
$:END_GPU_PARALLEL_LOOP()

This wraps the lines in code with parallelization calls to openACC or openMP, depending on environment and compiler settings.

Parameters

name data type Default Value description
code code Required Region of code where the GPU parallelizes loops
collapse integer None Number of loops to combine into 1 loop
parallelism string list '[gang,vector]' Parallelism granularity to use for this loop
default string 'present' Implicit assumptions compiler should make
private string list None Variables that are private to each iteration/thread
firstprivate string list None Initialized variables that are private to each iteration/thread
reduction 2-level string list None Variables unique to each iteration and reduced at the end
reductionOp string list None Operator that each list of reduction will reduce with
copy string list None Allocates and copies data to GPU on entrance, then deallocated and copies to CPU on exit
copyin string list None Allocates and copies data to GPU on entrance and then deallocated on exit
copyinReadOnly string list None Allocates and copies readonly data to GPU and then deallocated on exit
copyout string list None Allocates data on GPU on entrance and then deallocates and copies to CPU on exit
create string list None Allocates data on GPU on entrance and then deallocates on exit
no_create string list None Use data in CPU memory unless data is already in GPU memory
present string list None Data that must be present in GPU memory. Increment counter on entrance, decrement on exit
deviceptr string list None Pointer variables that are already allocated on GPU memory
attach string list None Attaches device pointer to device targets on entrance, then detach on exit
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Parameter Restrictions

name Restricted range
collapse Must be greater than 1
parallelism Valid elements: 'gang', 'worker', 'vector', 'seq'
default 'present' or 'none'

Additional information

  • default present means that the any non-scalar data in assumed to be present on the GPU
  • default none means that the compiler should not implicitly determine the data attributes for any variable
  • reduction and reductionOp must match in length
  • With reduction='[[sum1, sum2], [largest]]' and reductionOp='[+, max]', sum1 and sum2 will be the sum of sum1/sum2 in each loop iteration, and largest will the maximum value of largest all the loop iterations
  • A reduction implies a copy, so it does not need to be added for both

Example

#:call GPU_PARALLEL_LOOP(collapse=3, private='[tmp, r]', reduction='[[vol, avg], [max_val]]', reductionOp='[+, MAX]')
{code}
...
#:endcall GPU_PARALLEL_LOOP
#:call GPU_PARALLEL_LOOP(collapse=2, private='[sum_holder]', copyin='[starting_sum]', copyout='[eigenval,C]')
{code}
...
#:endcall GPU_PARALLEL_LOOP

GPU_LOOP(Execute loop on GPU)

Macro Invocation

Uses FYPP eval directive using $:

$:GPU_LOOP(...)

Parameters

name data type Default Value description
collapse integer None Number of loops to combine into 1 loop
parallelism string list None Parallelism granularity to use for this loop
data_dependency string None 'independent'-> assert loop iterations are independent, 'auto->let compiler analyze dependencies
private string list None Variables that are private to each iteration/thread
reduction 2-level string list None Variables unique to each iteration and reduced at the end
reductionOp string list None Operator that each list of reduction will reduce with
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Parameter Restrictions

name Restricted range
collapse Must be greater than 1
parallelism Valid elements: 'gang', 'worker', 'vector', 'seq'
data_dependency 'auto' or 'independent'

Additional information

  • Loop parallelism is most commonly '[seq]'
  • reduction and reductionOp must match in length
  • With reduction='[[sum1, sum2], [largest]]' and reductionOp='[+, max]', sum1 and sum2 will be the sum of sum1/sum2 in each loop iteration, and largest will the maximum value of largest all the loop iterations

Example

$:GPU_LOOP(parallelism='[seq]')
$:GPU_LOOP(collapse=3, parallelism='[seq]',private='[tmp, r]')

GPU_PARALLEL(Execute the following on the GPU in parallel)

Macro Invocation

Uses FYPP call directive using #:call

#:call GPU_PARALLEL(...)
{code}
#:endcall GPU_PARALLEL

Parameters

name data type Default Value description
code code Required Region of code where a kernel is launched on the GPU
default string 'present' Implicit assumptions compiler should make
private string list None Variables that are private to each iteration/thread
firstprivate string list None Initialized variables that are private to each iteration/thread
reduction 2-level string list None Variables unique to each iteration and reduced at the end
reductionOp string list None Operator that each list of reduction will reduce with
copy string list None Allocates and copies data to GPU on entrance, then deallocated and copies to CPU on exit
copyin string list None Allocates and copies data to GPU on entrance and then deallocated on exit
copyinReadOnly string list None Allocates and copies readonly data to GPU and then deallocated on exit
copyout string list None Allocates data on GPU on entrance and then deallocates and copies to CPU on exit
create string list None Allocates data on GPU on entrance and then deallocates on exit
no_create string list None Use data in CPU memory unless data is already in GPU memory
present string list None Data that must be present in GPU memory. Increment counter on entrance, decrement on exit
deviceptr string list None Pointer variables that are already allocated on GPU memory
attach string list None Attaches device pointer to device targets on entrance, then detach on exit
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Parameter Restrictions

name Restricted range
default 'present' or 'none'

Additional information

  • default present means that the any non-scalar data in assumed to be present on the GPU
  • default none means that the compiler should not implicitly determine the data attributes for any variable
  • reduction and reductionOp must match in length
  • With reduction='[[sum1, sum2], [largest]]' and reductionOp='[+, max]', sum1 and sum2 will be the sum of sum1/sum2 in each loop iteration, and largest will the maximum value of largest all the loop iterations
  • A reduction implies a copy, so it does not need to be added for both

Example

#:call GPU_PARALLEL()
{code}
...
#:endcall GPU_PARALLEL
#:call GPU_PARALLEL(create='[pixel_arr]', copyin='[initial_index]')
{code}
...
#:endcall


Data Control Macros

GPU_DATA(Make data accessible on GPU in specified region)

Macro Invocation

Uses FYPP call directive using #:call

#:call GPU_DATA(...)
{code}
#:endcall GPU_DATA

Parameters

name data type Default Value description
code code Required Region of code where defined data is accessible
copy string list None Allocates and copies variable to GPU on entrance, then deallocated and copies to CPU on exit
copyin string list None Allocates and copies data to GPU on entrance and then deallocated on exit
copyinReadOnly string list None Allocates and copies a readonly variable to GPU and then deallocated on exit
copyout string list None Allocates data on GPU on entrance and then deallocates and copies to CPU on exit
create string list None Allocates data on GPU on entrance and then deallocates on exit
no_create string list None Use data in CPU memory unless data is already in GPU memory
present string list None Data that must be present in GPU memory. Increment counter on entrance, decrement on exit
deviceptr string list None Pointer variables that are already allocated on GPU memory
attach string list None Attaches device pointer to device targets on entrance, then detach on exit
default string None Implicit assumptions compiler should make
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Parameter Restrictions

name Restricted range
code Do not assign it manually with key-value pairing

Example

#:call GPU_DATA(copy='[pixel_arr]', copyin='[starting_pixels, initial_index]',attach='[p_real, p_cmplx, p_fltr_cmplx]')
{code}
...
#:endcall GPU_DATA
#:call GPU_DATA(create='[pixel_arr]', copyin='[initial_index]')
{code}
...
#:endcall

GPU_ENTER_DATA(Allocate/move data to GPU until matching GPU_EXIT_DATA or program termination)

Macro Invocation

Uses FYPP eval directive using $:

$:GPU_ENTER_DATA(...)

Parameter

name data type Default Value description
copyin string list None Allocates and copies data to GPU on entrance
copyinReadOnly string list None Allocates and copies a readonly variable to GPU on entrance
create string list None Allocates data on GPU on entrance
attach string list None Attaches device pointer to device targets on entrance
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Example

$:GPU_ENTER_DATA(copyin='[pixels_arr]', copyinReadOnly='[starting_pixels, initial_index]')
$:GPU_ENTER_DATA(create='[bc_buffers(1:num_dims, -1:1)]', copyin='[initial_index]')

GPU_EXIT_DATA(Deallocate/move data from GPU created by GPU_ENTER_DATA)

Macro Invocation

Uses FYPP eval directive using $:

$:GPU_EXIT_DATA(...)

Parameters

name data type Default Value description
copyout string list None Deallocates and copies data from GPU to CPU on exit
delete string list None Deallocates data on GPU on exit
detach string list None Detach device pointer from device targets on exit
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Example

$:GPU_EXIT_DATA(copyout='[pixels_arr]', delete='[starting_pixels, initial_index]')
$:GPU_EXIT_DATA(delete='[bc_buffers(1:num_dims, -1:1)]', copyout='[initial_index]')

GPU_DECLARE(Allocate module variables on GPU or for implicit data region )

Macro Invocation

Uses FYPP eval directive using $:

$:GPU_DECLARE(...)

Parameters

name data type Default Value description
copy string list None Allocates and copies data to GPU on entrance, then deallocated and copies to CPU on exit
copyin string list None Allocates and copies data to GPU on entrance and then deallocated on exit
copyinReadOnly string list None Allocates and copies a readonly variable to GPU and then deallocated on exit
copyout string list None Allocates data on GPU on entrance and then deallocates and copies to CPU on exit
create string list None Allocates data on GPU on entrance and then deallocates on exit
present string list None Data that must be present in GPU memory. Increment counter on entrance, decrement on exit
deviceptr string list None Pointer variables that are already allocated on GPU memory
link string list None Declare global link, and only allocate when variable used in data clause.
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Additional information

  • An implicit data region is created at the start of each procedure and ends after the last executable statement in that procedure.
  • Use only create, copyin, device_resident or link clauses for module variables
  • GPU_DECLARE exit is the end of the implicit data region
  • Link is useful for large global static data objects

Example

$:GPU_DECLARE(create='[x_cb,y_cb,z_cb,x_cc,y_cc,z_cc,dx,dy,dz,dt,m,n,p]')
$:GPU_DECLARE(create='[x_cb,y_cb,z_cb]', copyin='[x_cc,y_cc,z_cc]', link='[dx,dy,dz,dt,m,n,p]')

GPU_UPDATE(Updates data from CPU to GPU or GPU to CPU)

Macro Invocation

Uses FYPP eval directive using $:

$:GPU_UPDATE(...)

Parameters

name data type Default Value description
host string list None Updates data from GPU to CPU
device string list None Updates data from CPU to GPU
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Example

$:GPU_UPDATE(host='[arr1, arr2]')
$:GPU_UPDATE(host='[updated_gpu_val]', device='[updated_cpu_val]')

GPU_HOST_DATA(Make GPU memory address available on CPU)

Macro Invocation

Uses FYPP call directive using #:call

#:call GPU_HOST_DATA(...)
{code}
#:endcall GPU_HOST_DATA

Parameters

name data type Default Value description
code code Required Region of code where GPU memory addresses is accessible
use_device_addr string list None Use GPU memory address of variable instead of CPU memory address
use_device_ptr string list None Use GPU pointer of pointers instead of CPU pointer
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Parameter Restrictions

name Restricted range
code Do not assign it manually with key-value pairing

Example

#:call GPU_HOST_DATA(use_device='[addr1, addr2]')
{code}
...
#:endcall GPU_HOST_DATA
#:call GPU_HOST_DATA(use_device='[display_arr]')
{code}
...
#:endcall


Synchronization Macros

GPU_WAIT(Makes CPU wait for async GPU activities)

Macro Invocation

Uses FYPP eval directive using $:

$:GPU_WAIT(...)

Parameters

name data type Default Value description
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Example

$:GPU_WAIT()

GPU_ATOMIC(Do an atomic operation on the GPU)

Macro Invocation

Uses FYPP eval directive using $:

$:GPU_ATOMIC(...)

Parameters

name data type Default Value description
atomic string Required Which atomic operation is performed
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Parameter Restrictions

name Restricted range
atomic 'read', 'write', 'update', or 'capture'

Additional information

  • read atomic is reading in a value
    • Ex: v=x
  • write atomic is writing a value to a variable
    • Ex:x=square(tmp)
  • update atomic is updating a variable in-place
    • Ex:x= x .and. 1
  • Capture is a pair of read/write/update operations with one dependent on the other

    • Ex:
    x=x .and. 1
    v=x

Example

$:GPU_ATOMIC(atomic='update')
x = square(x)
$:GPU_ATOMIC(atomic='capture')
x = square(x)
v = x


Miscellaneous Macros

GPU_ROUTINE(Compile a procedure for the GPU)

Macro Invocation

Uses FYPP eval directive using $:

$:GPU_ROUTINE(...)

Parameters

name data type Default Value description
function_name string None Name of subroutine/function
parallelism string list None Parallelism granularity to use for this routine
nohost boolean False Do not compile procedure code for CPU
cray_inline boolean False Inline procedure on cray compiler
extraAccArgs string None String of any extra arguments added to the OpenACC directive
extraOmpArgs string None String of any extra arguments added to the OpenMP directive

Parameter Restrictions

name Restricted range
parallelism Valid elements: 'gang', 'worker', 'vector', 'seq'

Additional information

  • Function name only needs to be given when cray_inline is True
  • Future capability is to parse function header for function name
  • Routine parallelism is most commonly '[seq]'

Example

$:GPU_ROUTINE(parallelism='[seq]')
$:GPU_ROUTINE(function_name='s_matmult', parallelism='[seq]', cray_inline=True)

GPU_CACHE(Data to be cache in software-managed cache)

Macro Invocation

Uses FYPP eval directive using $:

$:GPU_CACHE(...)

Parameters

name data type Default Value description
cache string list Required Data that should to stored in cache
extraAccArgs string None String of any extra arguments added to the OpenACC directive

NOTE Does not do anything for OpenMP currently

Example

$:GPU_CACHE(cache='[pixels_arr]')


Debugging Tools and Tips for GPUs

Compiler agnostic tools

OpenMP tools

OMP_DISPLAY_ENV=true | false | verbose
  • Prints out the internal control values and environment variables at the beginning of the program if true or verbose
  • verbose will also print out vendor-specific internal control values and environment variables
OMP_TARGET_OFFLOAD = MANDATORY | DISABLED | DEFAULT
  • Quick way to turn off off-load (DISABLED) or make it abort if a GPU isn't found (MANDATORY)
  • Great first test: does the problem disappear when you drop back to the CPU?
OMP_THREAD_LIMIT=<positive_integer>
  • Sets the maximum number of OpenMP threads to use in a contention group
  • Might be useful in checking for issues with contention or race conditions
OMP_DISPLAY_AFFINITY=TRUE
  • Will display affinity bindings for each OpenMP thread, containing hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding.

Cray Compiler Tools

Cray General Options

CRAY_ACC_DEBUG: 0 (off), 1, 2, 3 (very noisy)
  • Dumps a time-stamped log line (ACC: ...) for every allocation, data transfer, kernel launch, wait, etc. Great first stop when "nothing seems to run on the GPU".
  • Outputs on STDERR by default. Can be changed by setting CRAY_ACC_DEBUG_FILE.
    • Recognizes stderr, stdout, and process.
    • process automatically generates a new file based on pid (each MPI process will have a different file)
  • While this environment variable specifies ACC, it can be used for both OpenACC and OpenMP
CRAY_ACC_FORCE_EARLY_INIT=1
  • Force full GPU initialization at program start so you can see start-up hangs immediately
  • Default behavior without an environment variable is to defer initialization on first use
  • Device initialization includes initializing the GPU vendor’s low-level device runtime library (e.g., libcuda for NVIDIA GPUs) and establishing all necessary software contexts for interacting with the device

Cray OpenACC Options

CRAY_ACC_PRESENT_DUMP_SAVE_NAMES=1
  • Will cause acc_present_dump() to output variable names and file locations in addition to variable mappings
  • Add acc_present_dump() around hotspots to help find problems with data movements
    • Helps more if adding CRAY_ACC_DEBUG environment variable

NVHPC Compiler Options

NVHPC General Options

STATIC_RANDOM_SEED=1
  • Forces the seed returned by RANDOM_SEED to be constant, so it generates the same sequence of random numbers
  • Useful for testing issues with randomized data
NVCOMPILER_TERM=option[,option]
  • [no]debug: Enables/disables just-in-time debugging (debugging invoked on error)
  • [no]trace: Enables/disables stack traceback on error

NVHPC OpenACC Options

NVCOMPILER_ACC_NOTIFY= <bitmask>
  • Assign the environment variable to a bitmask to print out information to stderr for the following
    • kernel launches: 1
    • data transfers: 2
    • region entry/exit: 4
    • wait operation of synchronizations with the device: 8
    • device memory allocations and deallocations: 16
  • 1 (kernels only) is the usual first step.3 (kernels + copies) is great for "why is it so slow?"
NVCOMPILER_ACC_TIME=1
  • Lightweight profiler
  • prints a tidy end-of-run table with per-region and per-kernel times and bytes moved
  • Do not use with CUDA profiler at the same time
NVCOMPILER_ACC_DEBUG=1
  • Spews everything the runtime sees: host/device addresses, mapping events, present-table look-ups, etc.
  • Great for "partially present" or "pointer went missing" errors.
  • Doc for NVCOMPILER_ACC_DEBUG
    • Ctrl+F for NVCOMPILER_ACC_DEBUG

NVHPC OpenMP Options

LIBOMPTARGET_PROFILE=run.json
  • Emits a Chrome-trace (JSON) timeline you can open in chrome://tracing or Speedscope
  • Great lightweight profiler when Nsight is overkill.
  • Granularity in µs via LIBOMPTARGET_PROFILE_GRANULARITY (default 500).
LIBOMPTARGET_INFO=<bitmask>
  • Prints out different types of runtime information
  • Human-readable log of data-mapping inserts/updates, kernel launches, copies, waits.
  • Perfect first stop for "why is nothing copied?"
  • Flags
    • Print all data arguments upon entering an OpenMP device kernel: 0x01
    • Indicate when a mapped address already exists in the device mapping table: 0x02
    • Dump the contents of the device pointer map at kernel exit: 0x04
    • Indicate when an entry is changed in the device mapping table: 0x08
    • Print OpenMP kernel information from device plugins: 0x10
    • Indicate when data is copied to and from the device: 0x20
LIBOMPTARGET_DEBUG=1
  • Developer-level trace (host-side)
  • Much noisier than INFO
  • Only works if the runtime was built with -DOMPTARGET_DEBUG.
LIBOMPTARGET_JIT_OPT_LEVEL=-O{0,1,2,3}
  • This environment variable can be used to change the optimization pipeline used to optimize the embedded device code as part of the device JIT.
  • The value corresponds to the -O{0,1,2,3} command line argument passed to clang.
LIBOMPTARGET_JIT_SKIP_OPT=1
  • This environment variable can be used to skip the optimization pipeline during JIT compilation.
  • If set, the image will only be passed through the backend.
  • The backend is invoked with the LIBOMPTARGET_JIT_OPT_LEVEL flag.

Compiler Documentation