Unified Parallel C (UPC)

This chapter describes the Cray specific UPC functionality available in CCE, and features of the specification which are implementation defined.

This release supports the UPC Language Specification, Version 1.3. The UPC 1.3 standard is discussed on the UPC specification website, http://code.google.com/p/upc-specification.

This chapter describes the Cray specific UPC functionality available in CCE, and features of the specification which are implementation defined. Also see intro_pgas(7), or refer to the appropriate UPC man page.

Be familiar with UPC and understand the differences between the published UPC Introduction and Language Specification paper and the current UPC specification. If there's no familiarity with UPC, refer to the UPC home page at http://upc.gwu.edu. Under the Publications link, select the Introduction to UPC and Language Specification paper. This paper is slightly outdated but contains valuable information about understanding and using UPC. The UPC home page also contains, under the Documentation link, the UPC Language Specification paper.

UPC allows for explicitly specifying parallel programming through language syntax rather than library functions such as those used in MPI and SHMEM by allowing for the reading and writing of memory of other processes with simple assignment statements. Program synchronization occurs only when explicitly programmed; there is no implied synchronization.

UPC is a dialect of the C language. It is not available in C++.

UPC allows for the maintainence of a view of the program as a collection of threads operating in a common global address space without the burden of the details of how parallelism is implemented on the machine (for example, as shared memory or as a collection of physically distributed memories).

UPC data objects are private to a single thread or shared among all threads of execution. Each thread has a unique memory space that holds its private data objects, and access to a globally-shared memory space that is distributed across the threads. Thus, every part of a shared data object has an affinity to a single thread.

Cray UPC is compatible with MPI. While it may work in some cases, mixing language-based PGAS with SHMEM is not officially supported.

UPC 1.3 supports a parallel I/O model which provides control over file synchronization. However, if use is continued on the regular C I/O routines, supply the controls as needed to remove race conditions. File I/O under UPC is very similar to standard C because one thread opens a file and shares the file handle, and multiple threads may read or write to the same file.

Cray UPC supports GASP instrumentation. GASP instrumentation enables the use of external performance tools, such as the Parallel Performance Wizard (PPW) from the University of Florida. For more information on GASP and PPW, see http://gasp.hcs.ufl.edu and http://ppw.hcs.ufl.edu. To instrument for GASP, refer to the command line option -h [no]gasp[=opt[:opt]].

Predefined Macros

The following UPC 1.3 preprocessor macros are supported and defined as follows:
  • __UPC__: 1
  • __UPC_VERSION__: 201309 (corresponds to the date that 1.3 spec is published)
  • UPC_MAX_BLOCK_SIZE: 1073741823
  • __UPC_DYNAMIC_THREADS__: 1 (if compiling for dynamic threads, otherwise undefined)
  • __UPC_STATIC_THREADS__: 1 (if compiling for static threads, otherwise undefined)
  • __UPC_COLLECTIVE__: 1
  • __UPC_TICK__: 1
  • __UPC_CASTABLE__: 1
  • __UPC_IO__: 1
  • __UPC_NB__: 1

False Sharing

There is a false sharing hazard when referencing shared char and short integers.

If two PEs store a char or short to the same 64-bit word in memory without synchronization, incorrect results can occur. It is possible for one PE's store to be lost. This is because these stores are implemented by reading the entire 64-bit word, inserting the char or short value and writing the entire word back to memory.

The following output is a result of two PEs writing two different characters into the same word in memory without synchronization:
                 Register     Memory
Initial Value                 0x0000
PE 0 Reads         0x0000     0x0000
PE 1 Reads         0x0000     0x0000
PE 0 Inserts 3     0x3000     0x0000
PE 1 Inserts 7     0x0700     0x0000
PE 0 Writes        0x3000     0x3000
PE 1 Writes        0x0700     0x0700

Notice that the value stored by PE 0 has been lost. The final value intended was 0x3700. This situation is referred to as false sharing. It is the result of supporting data types that are smaller than the smallest type that can be individually read or written by the hardware. UPC programmers must take care when storing to shared char and short data that this situation does not occur.

Compile and Link UPC Code

Compiling a PGAS application (UPC, Fortran 2008) requires the PrgEnv-cray module to be loaded.

The -hupc option is required to enable recognition of UPC syntax because it is not part of the standard C language.

The -X npes option can optionally be used to define the number of threads to use and statically set the value of the THREADS constant. See -X npes for requirements regarding the use of the -X npes option.

The following command creates an executable file:
% cc -hupc hello.c -o hello
An executable can be created by linking together various object files that were generated from source code written in standard C, UPC, and Fortran. Either cc or ftn can be used to link the object files:
% cc -hupc x.o y.o z.o
% ftn x.o y.o z.o

For dynamic linking, add the -dynamic option. For information about linking PGAS applications to use huge pages, see the intro_hugepages(1) man page. The Cray implementation of UPC supports adding GASP instrumentation to UPC codes. To instrument for GASP, refer to the command line option -h gasp[=opt[:opt]].

Launch a UPC Application

After compiling the UPC code, run the program using the aprun command.

Launch the application using 128 PEs:
% aprun -n 128 ./hello

If using the –X npes compiler option, the same number of threads in the aprun command must be specified. The processing elements specified by npes are compute node cores/PEs.

By default, each PE reserves 64 MB of symmetric heap space. To increase or decrease this amount, set the XT_SYMMETRIC_HEAP_SIZE environment variable to the desired number of bytes. The suffixes K, M, and G are permitted to simplify requests for large values:
% export XT_SYMMETRIC_HEAP_SIZE=512M
% aprun -n 128 ./hello

The UPC run time system uses GNI and DMAPP (low level libraries) to implement a logically shared, distributed memory programming model. The symmetric heap is mapped onto hugepages by DMAPP. It is advisable to also map the static data and/or private heap onto huge pages. See the intro_hugepages(1) man page.

Cray Extensions

Cray extensions to UPC that are not part of the UPC Language Specification 1.3 are listed here.

A number of former extensions to UPC 1.2 have been standardized in UPC 1.3, including non-blocking bulk copies (upc_nb.h), privatizability (upc_castable.h) and timing (upc_tick.h) interfaces. These interfaces have been removed from the upc_cray.h header and moved into new headers as required by the UPC 1.3 specification. Additionally, some of the semantics and interfaces have been changed slightly, so existing users of these interfaces may need to update their applications.

The following interfaces, declared in upc_collective_cray.h, provide common collective operations on a subset (team) of threads. These are loosely based on the UPC Collectives Library 2.0 proposal, with changes to argument ordering to better match existing practice in UPC and no explicit initialization.
  • CRAY_UPC_TEAM_ALL
  • CRAY_UPC_TEAM_NODE
  • cray_upc_op_create(3c)
  • cray_upc_op_free(3c)
  • cray_upc_type_size(3c)
  • cray_upc_team_rank(3c)
  • cray_upc_team_size(3c)
  • cray_upc_team_split(3c)
  • cray_upc_team_free(3c)
  • cray_upc_team_barrier(3c)
  • cray_upc_team_allreduce(3c)
  • cray_upc_team_reduce(3c)
Include upc_cray.h to use these extensions.
upc_nodeof()
Returns the index of the node of the thread that has affinity to the shared object pointed to by ptr. Similar to upc_threadof().
NODES
NODES is an expression with a value of type int; it specifies the number of nodes and has the same value on every thread in the job.
Similar to THREADS, but evaluates to the number of nodes used by the application, equal to the ceiling of the aprun -n value divided by the -N value.
MYNODE
MYNODE is an expression with a value of type int; it specifies the unique node index associated with the current thread and has the same value on all threads that are located on the same node.
Similar to MYTHREAD, but evaluates to a node number in the range 0 to NODES - 1, inclusive.