OpenMP Overview

The OpenMP API provides a parallel programming model that is portable across shared memory architectures from Cray and other vendors.

The OpenMP API provides a parallel programming model that is portable across shared memory architectures from Cray and other vendors. The OpenMP specification is accessible at http://openmp.org/wp/openmp-specifications/.

Supported Version

CCE supports the OpenMP API, Version 4.0, with the following exceptions. The most up-to-date exceptions are listed on the man pages:

Task switching is not implemented. The thread that starts executing a task is the thread that finishes the task.
Support for OpenMP Random Access Iterators (RAIs) in the C++ Standard Template Library (STL) is deferred.
The task depend clause is supported, but tasks with dependences are serialized.
Cancellation does not destruct/deallocate implicitly private local variables. It correctly handles explicitly private variables.
simd functions will not vectorize if the function definition is not visible for inlining at the callsite.
The linear clause is not supported on combined or compound constructs.
The device clause is not supported. The other mechanisms for selecting a default device are supported: OMP_DEFAULT_DEVICE and omp_set_default_device.
The only API calls allowed in target regions are: omp_is_initial_device, omp_get_thread_num, omp_get_num_threads, omp_get_team_num, and omp_get_num_teams.
Parallel constructs are supported in target regions, but they are limited to one thread.
User-defined reductions are not supported in target regions.

CCE also supports a subset of the features and modifications introduced in OpenMP Version 4.5, including:

The nowait and depend clauses on the target and target update constructs
The private and firstprivate clause on the target construct
The target enter data and target exit data constructs
The is_device_ptr clause on the target construct
The use_device_ptr clause on the target data construct
The to and link clauses on the declare target directive
The always map type modifier on the map clause
The lastprivate clause on the distribute construct
The directive name modifer on if clauses
The device memory routines, including omp_target_alloc, omp_target_free, omp_target_is_present, omp_target_memcpy, omp_target_memcpy_rect, omp_target_associate_ptr, and omp_target_disassociate_ptr

Compiling

By default, the CCE compiler recognizes OpenMP directives. These CCE options affect OpenMP applications:

-h [no]omp
-h threadn

Executing

For OpenMP applications, use both the OMP_NUM_THREADS environment variable to specify the number of threads and the aprun -ddepth option to specify the number of CPUs hosting the threads. The number of threads specified by OMP_NUM_THREADS should not exceed the number of cores in the CPU. If neither the OMP_NUM_THREADS environment variable nor the omp_set_num_threads call is used to set the number of OpenMP threads, the system defaults to 1 thread. For further information, including example OpenMP programs, see the Cray Application Developer's Environment User's Guide.

Debugging

The -g option provides debugging support for OpenMP directives. The -g option provides debugging support identical to specifying the -G0 option. This level of debugging implies -homp which means that most optimizations disabled but OpenMP directives are recognized, and -h fp0. To debug without OpenMP, use -g -xomp or -g -hnoomp, which will disable OpenMP and turn on debugging.

OpenMP Implementation Defined Behavior

The OpenMP Application Program Interface Specification, presents a list of implementation defined behaviors. The Cray implementation is described in the following sections.

When multiple threads access the same shared memory location and at least one thread is a write, threads should be ordered by explicit synchronization to avoid data race conditions and the potential for non-deterministic results. Always use explicit synchronization for any access smaller than one byte.

Table 1. Initial Values of OpenMP ICVs
ICV	Initial Value	Note
`nthreads-var`	1
`dyn-var`	TRUE	Behaves according to Algorithm 2-1 of the specification.
`run-sched-var`	static, 0
`stacksize-var`	128 MB
`wait-policy-var`	ACTIVE
`thread-limit-var`	64	Threads may be dynamically created up to an upper limit which is 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription.
`max-active-levels-var`	1023
`def-sched-var`	static, 0	The chunksize is rounded up to improve alignment for vectorized loops.

Dynamic Adjustment of Threads

The ICV dyn-var is enabled by default. Threads may be dynamically created up to an upper limit which is 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription.

If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the runtime system can supply, the program terminates. The number of physical processors actually hosting the threads at any given time is fixed at program startup and is specified by the aprun -d depth option. The OMP_NESTED environment variable and the omp_set_nested() call control nested parallelism. To enable nesting, set OMP_NESTED to true or use the omp_set_nested() call. Nesting is disabled by default.

Tasks

There are no untied tasks in this implementation of OpenMP. There are also no implementation-defined task scheduling points.

Directives and Clauses

atomic directive
- When supported by the target architecture, atomic directives are lowered into hardware atomic instructions. Otherwise, atomicity is guaranteed with a lock. OpenMP atomic directives are compatible with C11 and C++11 atomic operations, as well as GNU atomic builtins.
for directive
- For the schedule(guided,chunk) clause, the size of the initial chunk for the master thread and other team members is approximately equal to the trip count divided by the number of threads.
- For the schedule(runtime) clause, the schedule type and chunk size can be chosen at run time by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the schedule type and chunk size default to static and 0, respectively.
- In the absence of the schedule clause, the default schedule is static and the default chunk size is approximately the number of iterations divided by the number of threads.
parallel directive
- If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the runtime system can supply, the program terminates.
- The number of physical processors actually hosting the threads at any given time is fixed at program startup and is specified by the aprun -d depth option.
- The OMP_NESTED environment variable and the omp_set_nested() call control nested parallelism. To enable nesting, set OMP_NESTED to true or use the omp_set_nested() call. Nesting is disabled by default.
loop directive
- The integer type or kind used to compute the iteration count of a collapsed loop are signed 64-bit integers, regardless of how the original induction variables and loop bounds are defined. If the schedule specified by the runtime schedule clause is specified and run-sched-var is auto, then the Cray implementation generates a static schedule.
private clause
- If a variable is declared as private, the variable is referenced in the definition of a statement function, and the statement function is used within the lexical extent of the directive construct, then the statement function references the private version of the variable.
sections construct
- Multiple structured blocks within a single sections construct are scheduled in lexical order and an individual block is assigned to the first thread that reaches it. It is possible for a different thread to execute each section block, or for a single thread to execute multiple section blocks. There is not a guaranteed order of execution of structured blocks within a section.
single directive
- A single block is assigned to the first thread in the team to reach the block; this thread may or may not be the master thread.

Library Routines

omp_set_num_threads
- Sets nthreads-var to a positive integer. If the argument is < 1, then set nthreads-var to 1.
omp_set_schedule
- Sets the schedule type as defined by the current specification. There are no implementation defined schedule types.
omp_set_max_active_levels
- Sets the max-active-levels-var ICV. Defaults to 1023. If argument is < 1, then set to 1.
omp_set_dynamic()
- The omp_set_dynamic() routine enables or disables dynamic adjustment of the number of threads available for the execution of subsequent parallel regions by setting the value of the dyn-var ICV. The default is on.
omp_set_nested()
- The omp_set_nested() routine enables or disables nested parallelism, by setting the nest-var internal control variable (ICV). The default is false.
omp_get_max_active_levels
- There is a single max-active-levels-var ICV for the entire runtime system. Thus, a call to omp_get_max_active_levels will bind to all threads, regardless of which thread calls it.

Environment Variables

OMP_SCHEDULE: The default values for this environment variable are static for type and 0 for chunk. For the schedule (guided,chunk) clause, the size of the initial chunk for the master thread and other team members is approximately equal to the trip count divided by the number of threads. For the schedule(runtime) clause, the schedule type and chunk size can be chosen at run time by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the schedule type and chunk size default to static and 0, respectively. In the absence of the schedule clause, the default schedule is static and the default chunk size is approximately the number of iterations divided by the number of threads.
OMP_NUM_THREADS: If this environment variable is not set and the omp_set_num_threads() routine is not used to set the number of OpenMP threads, the default is 1 thread. The maximum number of threads per compute node is 4 times the number of allocated processors. If the requested value of OMP_NUM_THREADS is more than the number of threads an implementation can support, the behavior of the program depends on the value of the OMP_DYNAMIC environment variable. If OMP_DYNAMIC is false, the program terminates. If OMP_DYNAMIC is true, it uses up to 4 times the number of allocated processors. For example, on a 8-core Cray XE system, this means the program can use up to 32 threads per compute node.
OMP_DYNAMIC: The default value is true.
OMP_NESTED: The default value is false.
OMP_STACKSIZE: The default value for this environment variable is 128 MB.
OMP_WAIT_POLICY: Provides a hint to an OpenMP implementation about the desired behavior of waiting threads by setting the wait-policy-var ICV. A compliant OpenMP implementation may or may not abide by the setting of the environment variable. The default value for this environment variable is active.
OMP_MAX_ACTIVE_LEVELS: The default value is 1023.
OMP_THREAD_LIMIT: Sets the number of OpenMP threads to use for the entire OpenMP program by setting the thread-limit-var ICV. The Cray implementation defaults to 4 times the number of available processors.

Cray-specific OpenMP API

This section describes Open MP API specific to Cray.

void cray_omp_set_wait_policy( const char *policy );

This routine allows dynamic modification of the wait-policy-var ICV value, which corresponds to the OMP_WAIT_POLICY environment variable. The policy argument provides a hint to the OpenMP runtime library environment about the desired behavior of waiting threads; acceptable values are ACTIVE or PASSIVE (case insensitive). It is an error to call this routine in an active parallel region. The OpenMP runtime library supports a "wait policy" and a "contention policy," both of which can be set with the following environment variables:

OMP_WAIT_POLICY=(ACTIVE|PASSIVE)
CRAY_OMP_CONTENTION_POLICY=(Automatic|Standard|MonitorMwait)

These environment variables allow the policies to be set once at program launch for the entire execution. However, in some circumstances it would be useful for the programmer to explicitly change the policy at various points during a program's execution. This Cray-specific routine allows the programmer to dynamically change the wait policy (and potentially the contention policy). This addresses the situation when an application needs OpenMP for the first part of program execution, but there is a clear point after which OpenMP is no longer used. Unfortunately, the idle OpenMP threads still consume resources since they are waiting for more work, resulting in performance degradation for the remainder of the application. A passive-waiting policy might eliminate the performance degradation after OpenMP is no longer needed, but the developer may still want an active-waiting policy for the OpenMP-intensive region of the application. This routine notifies all threads of the policy change at the same time, regardless of whether they are idle or active (to avoid deadlock from waiting and signaling threads using different policies).

CRAY_OMP_CHECK_AFFINITY

Set the CRAY_OMP_CHECK_AFFINITY variable to TRUE at execution time to display affinity binding for each OpenMP thread. The messages contain hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding.

OpenMP Accelerator Support

The OpenMP 4.5 target directives are supported for targeting NVIDIA GPUs or the current CPU target. An appropriate accelerator target module must be loaded to use target directives.