Nested Parallelism

The loop wrappers documented here abstracts a hierarchical parallelism model, which allows more fine grain control on core level vs. thread and vector level parallelism and allows explicitly defined caches for tightly nested loops. These wrappers provide a simplified interface of the hierarchical parallelism in Kokkos

For an example of the nested parallel wrappers in use, see the unit test

par_for_outer

par_for_outer abstracts the team or multicore parallelism for outer loops. Inside the loop body, in the lambda function provided to par_for_outer, synchronization and memory sharing between the threads in the single team is possible through the member_type team member type from Kokkos.

The bytes of scratch memory for cache needed by a single team is specified via the scratch_size_in_bytes, which needs be computed using ScratchPadXD::shmem_size.

The argument scratch_level defines where the scratch memory should be allocated. For CUDA GPUs, scratch_level=0 allocates the cache in the faster by smaller shared memory and scratch_level=1 allocates the cache in the slower but larger global or on device RAM memory. For CPUs, currently scratch_level makes no difference.

Note that every thread within a team will execute code inside a par_for_outer but outside of the par_for_inner.

par_for_inner

par_for_inner abstracts the thread and vector level parallelism of compute units within a single team or core. Work defined through a par_for_inner will be distributed between individual threads and vector lanes within the team.

ScratchPadXD

Data type for memory in scratch pad/cache memory. Use ScratchPadXD::shmem_size, which is documented in the Kokkos documentation for determining scratch pad memory needs before kernel launch.

Important usage hints

In order to ensure that individual threads of a team are synchronized always call team_member.team_barrier(); after an par_for_inner if the following execution depends on the results of the par_for_inner. This pertains, for example, to filling a ScratchPadXD array in one par_inner_for and using the scratch array in the next one, see the unit test for sample usage.

In addition, the entry to a par_for_inner does not imply a barrier and not all threads of a team may even enter an inner parallel region (e.g., if there is not enough work – read indices – for all team members). This can lead to unintended side-effects when all team member write to common variables, see this code for an example.

Cmake Options

PAR_LOOP_INNER_LAYOUT controls how the inner loop is implemented.

PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP uses the Kokkos TeamVectorRange, which merges TeamThreadRange and ThreadVectorRange into one loop, to distribute work between threads. PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP is the only option supported for CUDA since the Kokkos loops are required for parallelization on GPUs.

PAR_LOOP_INNER_LAYOUT=SIMDFOR_INNER_LOOP uses a for loop with a #pragma omp simd to vectorize the loop, which typically gives better vectorization loops than PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP on CPUs and so is the default on CPUs.