Tasks
Parthenon’s tasking infrastructure is how downstream applications describe
and execute their work. Tasks are organized into a hierarchy of objects.
TaskCollections have one or more TaskRegions, TaskRegions have
one or more TaskLists, and TaskLists can have one or more sublists
(that are themselves TaskLists).
Task
Though downstream codes never have to interact with the Task object directly,
it’s useful to describe nonetheless. A Task object is essentially a functor
that stores the necessary data to invoke a downstream code’s functions with
the desired arguments. Importantly, however, it also stores information that
relates itself to other tasks, namely the tasks that must be complete before
it should execute and the tasks that may be available to run after it completes.
In other words, Tasks are nodes in a directed (possibly cyclic) graph, and
include the edges that connect to it and emerge from it.
TaskList
The TaskList class stores a vector of all the tasks and sublists (a nested
TaskList) added to it. Additionally, it stores various bookkeeping
information that facilitate more advanced features described below. Adding
tasks and sublists are the only way to interact with TaskList objects.
The basic call to AddTask takes the task’s dependencies, the function to be
executed, and the arguments to the function as its arguments. AddTask returns
a TaskID object that can be used in subsequent calls to AddTask as a
dependency either on its own or combined with other TaskID``s via the ``|
operator. Use of the | operator is historical and perhaps a bit misleading as
it really acts as a logical and – that is, all tasks combined with | must be
complete before the dependencies are satisfied. An overload of AddTask takes
a TaskQualifier object as the first argument which specifies certain special,
non-default behaviors. These will be described below. Note that the default
constructor of TaskID produces a special object that when passed into
AddTask signifies that the task has no dependencies.
The AddSublist function adds a nested TaskList to the TaskList on
which its called. The principle use case for this is to add iterative cycles
to the graph, allowing one to execute a series of tasks repeatedly until some
criteria are satisfied. The call takes as arguments the dependencies (via
TaskID``s combined with ``|) that must be complete before the sublist
exectues and a std::pair<int, int> specifying the minimum
and maximum number of times the sublist should execute. Passing something like
{min_iters, max_iters} as the second argument should suffice, with {1, 1}
leading to a sublist that never cycles. AddSublist
returns a std::pair<TaskList&, TaskID> which is conveniently accessed via
a structured binding, e.g.
TaskID none;
auto [child_list, child_list_id] = parent_list.AddSublist(dependencies, {1,3});
auto task_id = child_list.AddTask(none, SomeFunction, arg1, arg2);
In the above example, passing none as the dependency for the task added to
child_list does not imply that this task can execute at any time since
child_list itself has dependencies that must be satisfied before any of its
tasks can be invoked.
TaskRegion
Under the hood, a TaskRegion is a directed, possibly cyclic graph. The graph
is built up incrementally as tasks are added to the TaskLists within the
TaskRegion, and it’s construction is completed upon the first time it’s
executed. TaskRegions can have one or more TaskLists. The primary reason
for this is to allow flexibility in how work is broken up into tasks (and
eventually kernels). A region with many lists will produce many small
tasks/kernels, but may expose more asynchrony (e.g. MPI communication). A region
with fewer lists will produce more work per kernel (which may be good for GPUs,
for example), but may limit asynchrony. Typically, each list is tied to a unique
partition of the mesh blocks owned by a rank. TaskRegion only provides a few
public facing functions:
- TaskListStatus Execute(ThreadPool &pool): TaskRegions can be executed, requiring a
ThreadPool be provided by the caller. In practice, Execute is usually
called from the Execute member function of TaskCollection.
- TaskList& operator[](const int i): return a reference to the ith
TaskList in the region.
- size_t size(): return the number of TaskLists in the region.
TaskCollection
A TaskCollection contains a
std::vector<TaskRegion>, i.e. an ordered list of TaskRegions.
Importantly, each TaskRegion will be executed to completion before
subsequent TaskRegions, introducing a notion of sequential
execution and enabling flexibility in task granularity. For example, the
following code fragment uses the TaskCollection and TaskRegion
abstractions to express work that can be done asynchronously across
blocks, followed by a bulk synchronous task involving all blocks, and
finally another round of asynchronous work.
TaskCollection tc;
TaskRegion &tr1 = tc.AddRegion(nmb);
for (int i = 0; i < nmb; i++) {
auto task_id = tr1[i].AddTask(dep, foo, args, blocks[i]);
}
{
TaskRegion &tr2 = tc.AddRegion(1);
auto sync_task = tr2[0].AddTask(dep, bar, args, blocks);
}
TaskRegion &tr3 = tc.AddRegion(nmb);
for (int i = 0; i < nmb; i++) {
auto task_id = tr3[i].AddTask(dep, foo, args, blocks[i]);
}
A diagram illustrating the relationship between these different classes is shown below.
TaskCollection provides a few
public-facing functions:
- TaskRegion& AddRegion(const int num_lists): Add and return a reference to
a new TaskRegion with the specified number of TaskList``s.
- ``TaskListStatus Execute(ThreadPool &pool): Execute all regions in the
collection. Regions are executed completely, in the order they were added,
before moving on to the next region. Task execution will take advantage of
the provided ThreadPool to (possibly) execute tasks across TaskList``s
in each region concurrently.
- ``TaskListStatus Execute(): Same as above, but execution will use an
internally generated ThreadPool with a single thread.
NOTE: Work remains to make the rest of
Parthenon thread-safe, so it is currently required to use a ThreadPool
with one thread.
Mesh data and Packing over collections of meshblocks
The most common way to interact with collections of mesh blocks is the
MeshData type, which collects MeshBlockData across a collection
of blocks. It is through MeshData that sparse packs can be built,
and multiple integrator stages, e.g., for a Runge-Kutte integrator,
may be exposed.
When building a set of task lists over a collection of mesh blocks, you may choose the number of blocks per task list by hand, but defaults can be set at runtime in one of two ways. The run time parameter
<parthenon/mesh>
pack_size = 1
specifies the number of blocks per pack. For example, a value of 1 indicates that each task list, even in synchronous regions, is over only 1 block. A value of -1 indicates it is all blocks per rank. You may choose -1 or any positive finite value.
Alternatively, you may set
<parthenon/mesh>
packs_per_rank = 1
which specifies the number of packs on a given rank. A value of 1 indicates each task list contains all blocks on a given rank. A value of 2 indicates each list contains half, etc.
If both of these parameters are set, then pack_size takes
precedent. If neither are set, the default is pack_size=-1, which
is equivalent to packs_per_rank=1.
In a task list, you may access this information as, for example,
const int num_partitions = pmesh->DefaultNumPartitions();
which will be equal to the pack size. Parthenon can automatically
partition your list of meshblocks into smaller block lists and wrap it
in a MeshData object with the GetOrAdd method. For example:
TaskRegion &tr = tc.AddRegion(num_partitions);
for (int i = 0; i < num_partitions; ++i) {
auto &tl = tr[i];
// mbase now points to the ith partition of the full block list
// and is the equivalent of the ``MeshData`` object named "base"
auto &mbase = pmesh->mesh_data.GetOrAdd("base", i);
}
and GetOrAdd may be used on any “complete” MeshData object
you’ve created on the entire block list.
TaskQualifier
TaskQualifier s provide a mechanism for downstream codes to alter the default
behavior of specific tasks in certain ways. The qualifiers are described below:
- TaskQualifier::local_sync : Tasks marked with local_sync synchronize across
lists in a region on a given MPI rank. Tasks that depend on a local_sync
marked task gain dependencies from the corresponding task on all lists within
a region. A typical use for this qualifier is to do a rank-local reduction, for
example before initiating a global MPI reduction (which should be done only once
per rank, not once per TaskList). Note that Parthenon links tasks across
lists in the order they are added to each list, i.e. the n``th ``local_sync task
in a list is assumed to be associated with the n``th ``local_sync task in all
lists in the region.
- TaskQualifier::global_sync : Tasks marked with global_sync implicitly have
the same semantics as local_sync, but additionally do a global reduction on the
TaskStatus to determine if/when execution can proceed on to dependent tasks.
- TaskQualifier::completion : Tasks marked with completion can lead to exiting
execution of the owning TaskList. If these tasks return TaskStatus::complete
and the minimum number of iterations of the list have been completed, the remainder
of the task list will be skipped (or the iteration stopped). Returning
TaskList::iterate leads to continued execution/iteration, unless the maximum
number of iterations has been reached.
- TaskQualifier::once_per_region : Tasks with the once_per_region qualifier
will only execute once (per iteration, if relevant) regardless of the number of
TaskList``s in the region. This can be useful when, for example, doing MPI
reductions, printing out some rank-wide state, or calling a ``completion task
that depends on some global condition where all lists would evaluate identical code.
TaskQualifier s can be combined via the | operator and all combinations are
supported. For example, you might mark a task global_sync | completion | once_per_region
if it were a task to determine whether an iteration should continue that depended
on some previously reduced quantity.