OpenACC Use

The OpenACC API allows the programmer to supplement information available to the compilers in order to offload code from a host CPU to an attached accelerator device.

OpenACC is a parallel programming model which facilitates the use of an accelerator device attached to a host CPU. The OpenACC API allows the programmer to supplement information available to the compilers in order to offload code from a host CPU to an attached accelerator device.

This release supports the OpenACC Application Programming Interface, Version 2.0 standard developed by PGI, Cray Inc., NVIDIA, with support from CAPS entreprise.

Refer to the OpenACC home page at http://www.openacc-standard.org. Under the Downloads link, select the OpenACC 2.0 Specification.

For the most current information regarding the Cray implementation of OpenACC, see the intro_openacc(7) man page. See the OpenACC.EXAMPLES(7) man page for example OpenACC codes.

OpenACC Execution Model

The CPU host offloads compute intensive regions to the accelerator device. The accelerator executes parallel regions, which contain work sharing loops executed as kernels on the accelerator. The CPU host manages execution on the accelerator by allocating memory on the accelerator, initiating data transfer, sending code, passing arguments to the region, waiting for completion, transferring accelerator results back to the CPU host and releasing memory.

The accelerator on the Cray system supports multiple levels of parallelism. The accelerator executes a kernel composed of parallel threads or vectors. Vectors (threads) are grouped into sets called workers. Threads in a set of workers are scheduled together and execute together. Workers are grouped into larger sets called gangs. One or more gangs may comprise a kernel. To summarize, a kernel is executed as a set of gangs of workers of vectors.

The compiler determines the number of gangs/workers/vectors based on the problem and then maps the vectors, workers, and gangs onto the accelerator architecture. Specifying the number of gangs, workers, or vectors is optional but may permit tuning to a particular target architecture. The way that the compiler maps a particular problem onto a constellation of gangs, workers, and vectors which are then mapped onto the accelerator architecture is implementation defined.

OpenACC terminology is situated in the context of the PGAS programming model. In the PGAS model, there may be one or more Processing Elements (PEs) per XK node. Each PE is multi-threaded and each thread can execute vector instructions. The PGAS thread concept is not the same as the OpenACC thread concept.

OpenACC Memory Model

The memory on the accelerator is separate from host memory. Accelerator device memory is not mapped onto the host's virtual memory space. All data movement between host and accelerator memory is initiated by the host through the library functions that move data. Also, it is not assumed that the accelerator can access host memory, though it is supported by some devices. In this model, data movement between memories is managed by the compiler according to OpenACC directives. The programmer needs to be aware of device memory size, as well as memory bandwidth between host and device in order to effectively accelerate a region of code.

Current accelerators implement a weak memory model; they do not support memory coherence between operations executed by different execution units - an execution unit is a hardware abstraction which can execute one or more gangs. If an operation updates a memory location and another reads from the same location, or two operations store a value to the same location, the hardware may not guarantee repeatable results. Some potential errors of this type are prevented by the compiler, but it is possible to write an accelerator parallel region that produces inconsistent results. Memory coherence is guaranteed when memory operations referencing the same location are separated by an explicit barrier.

Map the OpenACC Programming Model onto Accelerator Components

The compiler maps the OpenACC execution model (kernels, gangs, workers, vectors) onto the accelerator architecture as described in the following sections.

On the Cray XK system, there is one accelerator per node. The accelerator architecture is comprised of two main components - global memory and some number of streaming multiprocessors (SM). Each SM contains multiple scalar processor (SP) cores, schedulers, special-function units, and memory which is shared among all the SP cores. An SP core contains floating point, integer, logic, branching, and move and compare units. Each thread/vector is executed by a core. The SM manages thread execution.

The OpenACC execution model maps to the NVIDIA GPU hardware as follows (GPU terms are in parenthesis): One or more OpenACC kernels may execute on an GPU. The compiler divides a kernel into one or more gangs (blocks) of vectors (threads). Several concurrent gangs (blocks) of threads may execute on one SM depending on several factors, including memory requirements, compiler optimizations, or user directives. A single block (gang) does not span SMs and will remain on one SM until completion. When the SM encounters a block (gang), each gang (block) is further broken up into workers (warps) which are groups of threads to execute in parallel. Scheduling occurs at the granularity of the worker (warp). Individual threads within a warp start together and execute one common instruction at a time. If conditional branching occurs within a worker (warp), the warp serially executes each branch path taken causing some threads to wait until threads converge back to the same instruction. Data dependent conditional code within a warp usually has negative performance impact. Worker (warp) threads also fetch data from memory together and when accessing global memory, the accesses of the threads within a warp are grouped to minimize transactions. Each thread in a worker (warp) is executed on a different SP core.

There may be up to 32 threads in a worker (warp) - a limit defined by the hardware.

See the intro_openacc(7) man page for more detail on Partition Mapping.

There is a hierarchy of memory spaces used by OpenACC threads. Each thread has its own private local memory. Each gang of workers of threads has shared memory visible to all threads of the gang. All OpenACC threads running on a GPU have access to the same global memory. Global memory on the accelerator is accessible to the host CPU.

Mixed Model Support

OpenMP directives may appear inside of OpenACC data or host data regions only. OpenMP directives are not allowed inside of any other OpenACC directives.

OpenACC may not appear inside OpenMP directives. To have OpenACC directives nested inside of OpenMP constructs, place them in calls that are not inlined.

Compile with OpenACC

The CCE compiler recognizes OpenACC directives, by default. Use either the ftn or cc command to compile.

The CCE compiler does not produce CUDA code. It generates PTX (Parallel Thread Execution) instructions which are then translated into assembly.

Note the following interactions between directives and command line options.

-h [no]acc
-h noacc disables OpenACC directives.
-h [no]pragma
See -h [no]pragma=name[:name ...].
-h acc_model=option [:option ...]
Explicitly controls the execution and memory model utilized by the accelerator support system. The option arguments identify the type of behavior desired. There are three option sets. Only one member of a set may be used at a time; however, all three sets may be used together.
Default: auto_async_kernel:fast_addr:no_deep_copy
option Set 1:
auto_async_none
Execute kernels and updates synchronously, unless there is an async clause present on the kernels or update directive.
auto_async_kernel
(Default) Execute all kernels asynchronously ensuring program order is maintained.
auto_async_all
Execute all kernels and data transfers asynchronously, ensuring program order is maintained.
option Set 2:
no_fast_addr
Use default types for addressing.
fast_addr
(Default) Attempt to use 32 bit integers in all addressing to improve performance. Base addresses remain as 64 bit. The performance is improved by potentially using fewer registers and faster arithmetic for offset calculations. This optimization may result in incorrect behavior for codes that make use within accelerator regions of any of the following: very large arrays (offsets would require greater than 32 bits), very large array lower bounds (max offset plus lower bound is greater than 32 bits), bitfields/other bit operations.
option Set 3:
no_deep_copy
Do not look inside of an object type to transfer sub-objects. Allocatable members of derived type objects will not be allocated on the device.

Module Support

To compile, ensure that PrgEnv-cray module is loaded and that it includes CCE 8.5 or later. Then, either load the craype-accel-nvidia20 module for Fermi support or the craype-accel-nvidia35 module for Kepler support.

The craype-accel-host module supports compiling and running an OpenACC application on the host X86 processor. This provides source code portability between systems with and without an accelerator. The accelerator directives are automatically converted at compile time to OpenMP equivalent directives.

Use either the ftn or cc command to compile.

Debug

Use either Alinea DDT or Rogue Wave TotalView.

The following applies to all debuggers:

To enable debugging, compile use the -g option.
When compiling with the debug option (-g), CCE may require additional memory from the accelerator heap, exceeding the 8MB default. In this case, there will be malloc failures during compilation. The environment variable CRAY_ACC_MALLOC_HEAPSIZE specifies the accelerator heap size in bytes. It may be necessary to increase the accelerator heap size to 32MB (33554432), 64MB (67108864), or greater by setting CRAY_ACC_MALLOC_HEAPSIZE accordingly. The accelerator heap size defaults to 8MB.
Debug one rank/image/thread/PE per node.
CCE does not generate CUDA code, but generates PTX code. Debuggers will not display CUDA intermediate code.
To enter an OpenACC region using a debugger, breakpoints may be set inside the OpenACC region. It is not possible to do a single step into the region from the code immediately prior to the start of an OpenACC directive.

OpenACC Directives

For information on the OpenACC directives, see the OpenACC 2.0 Specification available at http://www.openacc-standard.org.

For the most current information regarding the Cray implementation of OpenACC, see the intro_openacc(7) man page. See the OpenACC.EXAMPLES(7) man page for example OpenACC codes.

Runtime Routines

Runtime routines defined by the standard specification are supported unless otherwise noted in the intro_openacc(7) man page.

Cray Specific Runtime Library Routines

The following routines are currently Cray specific. These interfaces are subject to change and their usage may result in non-portable code:

void cray_acc_memcpy_to_host_async( void* host_destination, const void* device_source,size_t size, int async_id );
Asynchronously copies size bytes from the accelerator source address to the host destination address; returns destination address. See async clause for explanation of async_id.
void cray_acc_memcpy_to_device_async( void* host_destination, const void* device_source,size_t size, int async_id);
Asynchronously copies size bytes from the accelerator source address to the host destination address; returns destination address. See async clause for explanation of async_id.
bool cray_acc_get_async_info( int async_id, void* async_info );
Returns true if the async_id was found to have any architecture specific async information available. The user is responsible for ensuring that the async_info pointer points to a async structure from the underlying architecture. For an NVIDIA target this would be a CUDA Stream (CUstream).

CRAY_ACC_DEBUG Output Routines

When the runtime environment variable CRAY_ACC_DEBUG is set to 1, 2, or 3, CCE writes runtime commentary of accelerator activity to STDERR for debugging purposes; every accelerator action on every PE generates output prefixed with "ACC:". This may produce a large volume of output and it may be difficult to associate messages with certain routines and/or certain PEs.

With this set of API calls, the programmer can enable or disable output at certain points in the code, and modify the string that is used as the debug message prefix.

The cray_acc_set_debug_*_prefix( void ) routines define a string that is used as the prefix, with the default being "ACC:". The cray_acc_get_debug_*_prefix( void ) routines are provided so that the previous setting can be restored.

Output from the library is printed with a format string starting with "ACC: %s %s", where the global prefix is printed for the first %s (if not NULL), and the thread prefix is printed for the second %s. The global prefix is shared by all host threads in the application, and the thread prefix is set per-thread. By default, strings used in the %s fields are empty.

The C interface is provided by omp.h:

char *cray_acc_get_debug_global_prefix( void )
void cray_acc_set_debug_global_prefix( char * )
char *cray_acc_get_debug_thread_prefix( void )
void cray_acc_set_debug_thread_prefix( char * )

To enable debug output, set level from 1 to 3, with 3 being the most verbose. Setting a level less than or equal to 0 disables the debug output. The get version is provided so the previous setting can be restored. The thread level is an optional override of the global level.

int cray_acc_get_debug_global_level( void )
void cray_acc_set_debug_global_level( int level )
int cray_acc_get_debug_thread_level( void )
void cray_acc_set_debug_thread_level( int level )

Environment Variables

The following are environment variables are defined by the API specification:

ACC_DEVICE_NUM
ACC_DEVICE_TYPE

The following environment variable is Cray specific:

CRAY_ACC_MALLOC_HEAPSIZE
Specifies the accelerator heap size in bytes. The accelerator heap size defaults to 8MB. When compiling with the debug option (-g), CCE may require additional memory from the accelerator heap, exceeding the 8MB default. In this case, there will be malloc failures during compilation. It may be necessary to increase the accelerator heap size to 32MB (33554432), 64MB (67108864), or greater.
CRAY_ACC_DEBUG
When set to 1, 2, or 3 (most verbose), writes runtime commentary of accelerator activity to STDERR for debugging purposes. There is also an API which allows the programmer to enable/disable debug output and set the output message prefix from within the application.

OpenACC Examples

See the OpenACC.EXAMPLES(7) man page for example OpenACC codes.