|
MFC
High-fidelity multiphase flow simulation
|
MFC compiles GPU code via OpenACC and in the future OpenMP as well.
In order to swap between OpenACC and OpenMP, custom GPU macros are used that translate to equivalent OpenACC and OpenMP directives. FYPP is used to process the GPU macros.
Note: Ordering is not guaranteed or stable, so use key-value pairing when using macros
True or False'[hello, world, Fortran]''[[hello, world], [Fortran, MFC]]' or '[[hello]]'GPU_PARALLEL_LOOP – (Execute the following loop on the GPU in parallel)Macro Invocation
In order to parallelize a loop, simply place two macro calls on either end of the loop:
This wraps the lines in code with parallelization calls to openACC or openMP, depending on environment and compiler settings.
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
code | code | Required | Region of code where the GPU parallelizes loops |
collapse | integer | None | Number of loops to combine into 1 loop |
parallelism | string list | '[gang,vector]' | Parallelism granularity to use for this loop |
default | string | 'present' | Implicit assumptions compiler should make |
private | string list | None | Variables that are private to each iteration/thread |
firstprivate | string list | None | Initialized variables that are private to each iteration/thread |
reduction | 2-level string list | None | Variables unique to each iteration and reduced at the end |
reductionOp | string list | None | Operator that each list of reduction will reduce with |
copy | string list | None | Allocates and copies data to GPU on entrance, then deallocated and copies to CPU on exit |
copyin | string list | None | Allocates and copies data to GPU on entrance and then deallocated on exit |
copyinReadOnly | string list | None | Allocates and copies readonly data to GPU and then deallocated on exit |
copyout | string list | None | Allocates data on GPU on entrance and then deallocates and copies to CPU on exit |
create | string list | None | Allocates data on GPU on entrance and then deallocates on exit |
no_create | string list | None | Use data in CPU memory unless data is already in GPU memory |
present | string list | None | Data that must be present in GPU memory. Increment counter on entrance, decrement on exit |
deviceptr | string list | None | Pointer variables that are already allocated on GPU memory |
attach | string list | None | Attaches device pointer to device targets on entrance, then detach on exit |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Parameter Restrictions
| name | Restricted range |
|---|---|
collapse | Must be greater than 1 |
parallelism | Valid elements: 'gang', 'worker', 'vector', 'seq' |
default | 'present' or 'none' |
Additional information
reduction='[[sum1, sum2], [largest]]' and reductionOp='[+, max]', sum1 and sum2 will be the sum of sum1/sum2 in each loop iteration, and largest will the maximum value of largest all the loop iterationsExample
GPU_LOOP – (Execute loop on GPU)Macro Invocation
Uses FYPP eval directive using $:
$:GPU_LOOP(...)
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
collapse | integer | None | Number of loops to combine into 1 loop |
parallelism | string list | None | Parallelism granularity to use for this loop |
data_dependency | string | None | 'independent'-> assert loop iterations are independent, 'auto->let compiler analyze dependencies |
private | string list | None | Variables that are private to each iteration/thread |
reduction | 2-level string list | None | Variables unique to each iteration and reduced at the end |
reductionOp | string list | None | Operator that each list of reduction will reduce with |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Parameter Restrictions
| name | Restricted range |
|---|---|
collapse | Must be greater than 1 |
parallelism | Valid elements: 'gang', 'worker', 'vector', 'seq' |
data_dependency | 'auto' or 'independent' |
Additional information
'[seq]'reduction='[[sum1, sum2], [largest]]' and reductionOp='[+, max]', sum1 and sum2 will be the sum of sum1/sum2 in each loop iteration, and largest will the maximum value of largest all the loop iterationsExample
GPU_PARALLEL – (Execute the following on the GPU in parallel)Macro Invocation
Uses FYPP call directive using #:call
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
code | code | Required | Region of code where a kernel is launched on the GPU |
default | string | 'present' | Implicit assumptions compiler should make |
private | string list | None | Variables that are private to each iteration/thread |
firstprivate | string list | None | Initialized variables that are private to each iteration/thread |
reduction | 2-level string list | None | Variables unique to each iteration and reduced at the end |
reductionOp | string list | None | Operator that each list of reduction will reduce with |
copy | string list | None | Allocates and copies data to GPU on entrance, then deallocated and copies to CPU on exit |
copyin | string list | None | Allocates and copies data to GPU on entrance and then deallocated on exit |
copyinReadOnly | string list | None | Allocates and copies readonly data to GPU and then deallocated on exit |
copyout | string list | None | Allocates data on GPU on entrance and then deallocates and copies to CPU on exit |
create | string list | None | Allocates data on GPU on entrance and then deallocates on exit |
no_create | string list | None | Use data in CPU memory unless data is already in GPU memory |
present | string list | None | Data that must be present in GPU memory. Increment counter on entrance, decrement on exit |
deviceptr | string list | None | Pointer variables that are already allocated on GPU memory |
attach | string list | None | Attaches device pointer to device targets on entrance, then detach on exit |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Parameter Restrictions
| name | Restricted range |
|---|---|
default | 'present' or 'none' |
Additional information
reduction='[[sum1, sum2], [largest]]' and reductionOp='[+, max]', sum1 and sum2 will be the sum of sum1/sum2 in each loop iteration, and largest will the maximum value of largest all the loop iterationsExample
GPU_DATA – (Make data accessible on GPU in specified region)Macro Invocation
Uses FYPP call directive using #:call
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
code | code | Required | Region of code where defined data is accessible |
copy | string list | None | Allocates and copies variable to GPU on entrance, then deallocated and copies to CPU on exit |
copyin | string list | None | Allocates and copies data to GPU on entrance and then deallocated on exit |
copyinReadOnly | string list | None | Allocates and copies a readonly variable to GPU and then deallocated on exit |
copyout | string list | None | Allocates data on GPU on entrance and then deallocates and copies to CPU on exit |
create | string list | None | Allocates data on GPU on entrance and then deallocates on exit |
no_create | string list | None | Use data in CPU memory unless data is already in GPU memory |
present | string list | None | Data that must be present in GPU memory. Increment counter on entrance, decrement on exit |
deviceptr | string list | None | Pointer variables that are already allocated on GPU memory |
attach | string list | None | Attaches device pointer to device targets on entrance, then detach on exit |
default | string | None | Implicit assumptions compiler should make |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Parameter Restrictions
| name | Restricted range |
|---|---|
code | Do not assign it manually with key-value pairing |
Example
GPU_ENTER_DATA – (Allocate/move data to GPU until matching GPU_EXIT_DATA or program termination)Macro Invocation
Uses FYPP eval directive using $:
$:GPU_ENTER_DATA(...)
Parameter
| name | data type | Default Value | description |
|---|---|---|---|
copyin | string list | None | Allocates and copies data to GPU on entrance |
copyinReadOnly | string list | None | Allocates and copies a readonly variable to GPU on entrance |
create | string list | None | Allocates data on GPU on entrance |
attach | string list | None | Attaches device pointer to device targets on entrance |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Example
GPU_EXIT_DATA – (Deallocate/move data from GPU created by GPU_ENTER_DATA)Macro Invocation
Uses FYPP eval directive using $:
$:GPU_EXIT_DATA(...)
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
copyout | string list | None | Deallocates and copies data from GPU to CPU on exit |
delete | string list | None | Deallocates data on GPU on exit |
detach | string list | None | Detach device pointer from device targets on exit |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Example
GPU_DECLARE – (Allocate module variables on GPU or for implicit data region )Macro Invocation
Uses FYPP eval directive using $:
$:GPU_DECLARE(...)
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
copy | string list | None | Allocates and copies data to GPU on entrance, then deallocated and copies to CPU on exit |
copyin | string list | None | Allocates and copies data to GPU on entrance and then deallocated on exit |
copyinReadOnly | string list | None | Allocates and copies a readonly variable to GPU and then deallocated on exit |
copyout | string list | None | Allocates data on GPU on entrance and then deallocates and copies to CPU on exit |
create | string list | None | Allocates data on GPU on entrance and then deallocates on exit |
present | string list | None | Data that must be present in GPU memory. Increment counter on entrance, decrement on exit |
deviceptr | string list | None | Pointer variables that are already allocated on GPU memory |
link | string list | None | Declare global link, and only allocate when variable used in data clause. |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Additional information
Example
GPU_UPDATE – (Updates data from CPU to GPU or GPU to CPU)Macro Invocation
Uses FYPP eval directive using $:
$:GPU_UPDATE(...)
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
host | string list | None | Updates data from GPU to CPU |
device | string list | None | Updates data from CPU to GPU |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Example
GPU_HOST_DATA – (Make GPU memory address available on CPU)Macro Invocation
Uses FYPP call directive using #:call
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
code | code | Required | Region of code where GPU memory addresses is accessible |
use_device_addr | string list | None | Use GPU memory address of variable instead of CPU memory address |
use_device_ptr | string list | None | Use GPU pointer of pointers instead of CPU pointer |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Parameter Restrictions
| name | Restricted range |
|---|---|
code | Do not assign it manually with key-value pairing |
Example
GPU_WAIT – (Makes CPU wait for async GPU activities)Macro Invocation
Uses FYPP eval directive using $:
$:GPU_WAIT(...)
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Example
GPU_ATOMIC – (Do an atomic operation on the GPU)Macro Invocation
Uses FYPP eval directive using $:
$:GPU_ATOMIC(...)
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
atomic | string | Required | Which atomic operation is performed |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Parameter Restrictions
| name | Restricted range |
|---|---|
atomic | 'read', 'write', 'update', or 'capture' |
Additional information
v=xx=square(tmp)x= x .and. 1Capture is a pair of read/write/update operations with one dependent on the other
Example
GPU_ROUTINE – (Compile a procedure for the GPU)Macro Invocation
Uses FYPP eval directive using $:
$:GPU_ROUTINE(...)
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
function_name | string | None | Name of subroutine/function |
parallelism | string list | None | Parallelism granularity to use for this routine |
nohost | boolean | False | Do not compile procedure code for CPU |
cray_inline | boolean | False | Inline procedure on cray compiler |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
extraOmpArgs | string | None | String of any extra arguments added to the OpenMP directive |
Parameter Restrictions
| name | Restricted range |
|---|---|
parallelism | Valid elements: 'gang', 'worker', 'vector', 'seq' |
Additional information
'[seq]'Example
GPU_CACHE – (Data to be cache in software-managed cache)Macro Invocation
Uses FYPP eval directive using $:
$:GPU_CACHE(...)
Parameters
| name | data type | Default Value | description |
|---|---|---|---|
cache | string list | Required | Data that should to stored in cache |
extraAccArgs | string | None | String of any extra arguments added to the OpenACC directive |
NOTE Does not do anything for OpenMP currently
Example
true or verboseverbose will also print out vendor-specific internal control values and environment variablesDISABLED) or make it abort if a GPU isn't found (MANDATORY)ACC: ...) for every allocation, data transfer, kernel launch, wait, etc. Great first stop when "nothing seems to run on the GPU".CRAY_ACC_DEBUG_FILE.stderr, stdout, and process.process automatically generates a new file based on pid (each MPI process will have a different file)acc_present_dump() to output variable names and file locations in addition to variable mappingsacc_present_dump() around hotspots to help find problems with data movementsCRAY_ACC_DEBUG environment variableRANDOM_SEED to be constant, so it generates the same sequence of random numbers[no]debug: Enables/disables just-in-time debugging (debugging invoked on error)[no]trace: Enables/disables stack traceback on errorNVCOMPILER_ACC_DEBUGLIBOMPTARGET_PROFILE_GRANULARITY (default 500).INFO-DOMPTARGET_DEBUG.-O{0,1,2,3} command line argument passed to clang.LIBOMPTARGET_JIT_OPT_LEVEL flag.