.. _boundary_communication: Boundary communication-in-one concepts ====================================== On CPU, the naive implementation of ghost halos for meshblocks is to loop over each geometric sub-area of a ghost halo relevant for communicating to neighboring meshblocks. In practice, on accelerators, this is not performant, as launching a GPU kernel per sub-halo costs significant latency. To resolve this issue, we coalesce GPU kernels into hierarchically paraellel loops, where the outer loop is over sub-halos, and the inner loop is over the cells within a sub-halo. The information required to, e.g., send, receive, prolongate, restrict information in each sub-halo is precomputed and cached. Per sub-halo, there is a struct, ``BndInfo_t``, which contains the relevant information needed on device to perform loop operations. The caching is enabled by two structs, ``BvarsSubCache_t`` which contains all the arrays on host and device needed mostly containing these ``BndInfo_t`` objects, and ``BvarsCache_t`` which is essentially an array of all the kinds of caches needed for the various kinds of boundary operations performed. We note that this infrastructure is more general than just ghost halos. The same machinery is used for communicating the interiors of meshblocks that are restricted and/or prolongated between geometric multi-grid levels. Additionally, with the ownership model for non-face fields, the basic communication infrastructure could deal with flux correction as well (although currently flux correction uses a somewhat separate code path). Buffer subsets -------------- Sometimes it is desirable, for example for custom prolongation and restriction or to communicate on a single geometric multi-grid level, to loop over a subset of the ghost sub-halos, rather than all halos at once. This is enabled by the ``buffer_subsets`` and ``buffer_subsets_h`` arrays, which are contained in ``BvarsSubCache_t``. The ``buffer_subsets`` array is a matrix, where the rows index the subset, and the columns point to the indices in the ``bnd_info`` array containing the subset of sub-halos you wish to operate on. To communicate across a particular boundary type, the templated boundary communication routines (see :ref:`boundary_comm_tasks`) should be instantiated with the desired ``BoundaryType``, i.e. .. code:: cpp SendBoundBufs(md); The different ``BoundaryType``\ s are: - ``any``: Communications are performed between all leaf blocks (i.e. the standard Parthenon grid that does not include multi-grid related blocks). - ``local``: Communications are performed between all leaf blocks that are on the current rank. *Currently, this option should not be used because there are possibly associated bugs. This and nonlocal would only be used as a potential performance enhancement, calling both should result in the same communication as just calling BoundaryType::any.* - ``nonlocal``: Communications are performed between all leaf blocks that are on different ranks than the current rank. *Currently, this option should not be used because there are possibly associated bugs.* - ``flxcor_send`` and ``flxcor_recv``: Used for flux correction routines, see below. - ``gmg_same``: Communicates ghost halos between blocks in the same geometric multi-grid level. - ``gmg_restrict_send`` and ``gmg_restrict_recv``: For restricting block interiors between geometric multi-grid levels, i.e. inter-grid communication. *It is probably not necessary to have separate communicators for sending and receiving, but for now this is the way if was written* - ``gmg_prolongate_send`` and ``gmg_prolongate_recv``: For prolongating block interiors between geometric multi-grid levels, i.e. inter-grid communication. *It is probably not necessary to have separate communicators for sending and receiving, but for now this is the way if was written* .. _sparse boundary comm: Sparse boundary communication ============================= Communication between meshblocks at all stages is required to synchronize fields that include ghost zones. For each pair of meshblocks **a** and **b** that share a boundary, we define two *communication channel*\ s (**a**->\ **b** and **b**->\ **a**) for each field that communicates ghost zone data. This communication channel can be shared amongst all stages contained in the ``MeshBlock`` field ``DataCollection> meshblock_data``, since the communication during different stages doesn’t overlap. A communication channel needs to contain some state that indicates if information has been sent from block **a** but not received on block **b**, or if the information has been received. Additionally, the communication channel should contain storage if data is being communicated, which may not always be the case for sparse variables. When the sender is sparse and unallocated, no storage communication buffer storage is required since the ghost zones of the receiver should be filled with default values. When the sender is allocated, we always allocate buffer storage. For (potentially) sparse variables, there should be five possible states of the buffer: - ``sending``: The sender has filled the buffer with data meets the threshold for sparse allocation. - ``sending_null``: The sender is either unallocated, so that it just contains default data, or the sender is allocated and but all of the values in the communicated region fall below the sparse threshold. - ``received``: The receiver has successfully queried that information has been sent and the buffer contains data that is above the sparse threshold. - ``received_null``: The receiver has successfully queried that information has been sent and the buffer contains data that is above the sparse threshold. - ``stale``: The receiver no longer needs the information currently stored by the communication channel. Information about the indices of the data being copied from the sender and where they are copied to in the receiver is kept separate from the communication channel in the current implementation. The only information the communication channel has is the size of buffer required. Communication channels on the Mesh ---------------------------------- Communication channels on a mesh can be uniquely identified by the global id (gid) of the sending block, the global id (gid) of the receiving block, the variable for which data is being sent, and the geometric element of the sending block that is shared with the receiving block, with the geometric element being the volume, a face, an edge, or a node of a block. The geometric element is necessary for unique identification, since with periodic boundary conditions two blocks may share more than a single geometric element. Each block has 27 geometric elements which can each be represented with a vector extending from the center of the block to the center of the geometric element (e.g. the volume of the block corresponds to (0, 0, 0),the top face of the block would be represented by (0, 0, 1), the bottom face by (0,0,-1), some edge by (0, 1, -1), and a vertex to (1, 1, 1)). This vector can be flattened as ``geom_idx = (vec.x + 1) + 3 * (vec.y + 1) * 9 * (vec.z + 1)`` to assign a unique element to each index. Also note that if two blocks share an element (or at least part of an element in the case of a multi-level mesh), the corresponding geometric element index of the second element is obviously ``geom_idx = (vec.x - 1) + 3 * (vec.y - 1) * 9 * (vec.z - 1)`` when ``vec`` is the offset vector of the first element. This information is stored in a ``NeighborIndexes`` object. In practice, we denote each channel by a unique key ``std::tuple{send_gid, receive_gid, var_name, geometric_element_index_of_sender}``, so that the ``Mesh`` can contain a map from these keys to communication channels. Then, at each remesh, sending blocks and blocks that are receiving from blocks on a different rank can create new communication channels and register them in this map. MPI Communication IDs ~~~~~~~~~~~~~~~~~~~~~ For MPI commnunication, we need to uniquely identify MPI messages between an ordered pair of ranks. In an easy world, we would do all communication on ``MPI_COMM_WORLD`` and just create a unique tag from the elements of the unique key tuple by doing something like ``tag = num_blocks_total * nvars * 27 * send_gid + nvars * 27 * receive_gid + 27 * var_idx + geometric_element_index_of_sender``, but this creates sparsely populated tag numbers and quickly blows through the maximum tag limit of 32767 defined by the MPI standard. In practice, we create an MPI communicator associated with each variable (see ``Mesh::SetupMPIComms()`` and ``Mesh::GetMPIComm(std::string)``) and for each pair of ranks fill a ``std::map`` contained in the ``Mesh`` with zeros. Since ``std::map`` is always sorted and we define comparators for ``UnorderedPair``, we end up with the same ordering of the map on both ranks of the pair. We can then iterate through the map and assign a unique, densely populated tag to each pair in the map, which does not blow through the MPI tag limit. The same tags can obviously be re-used in each communicator. Utilities classes for boundary communication -------------------------------------------- ``class CommBuffer`` ~~~~~~~~~~~~~~~~~~~~~~~ This general idea is implemented in the ``CommBuffer`` class template, which contains a field ``T buf_`` and access to that field through the method ``T& buffer()``. ``T`` is assumed to be some type that can act as a buffer for storing the communicated data that has methods ``size()`` and ``data()``, the latter of which must give access to the underlying storage that can be used by the MPI communication calls. ``CommBuffer`` has the public methods: - ``T& buffer()``: Gives access to the the actual data buffer for filling and reading. - ``void Allocate()``: Which allocates storage for the buffer and can be called internally by ``TryReceive()`` when non-null data is being communicated by a block on a different rank. - ``void Send()``: If the underlying storage buffer is allocated, sets the state of the buffer to ``sending``. If the underlying storage buffer is unallocated, set the state of the buffer to ``sending_null``. Also starts asynchronous MPI send if sender and receiver are on separate ranks. - ``void SendNull()``: Sets the buffer state to ``sending_null``. Also starts asynchronous MPI send of a zero length buffer if sender and receiver are on separate ranks. - ``void TryStartReceive()``: If on same rank, this does nothing. If on different ranks and irecv_started flag is set, does nothing. If on different ranks and it is a receiving buffer and ``do_sparse_allocation = false`` for the buffer, posts an ``MPI_Irecv`` right away, allocates the buffer if it is not already allocated, and flags that ``MPI_Irecv`` has been called. If on different ranks and it is a receiving buffer and ``do_sparse_allocation = true``, calls ``MPI_Iprobe`` to see if a message is available. If there is a message, check it is sending data or sending null, allocates or deallocates the buffer as necessary, and then posts an ``MPI_Irecv`` and sets the ``MPI_Irecv`` flag. - ``bool TryReceive()``: If on same rank, checks if state is ``sending`` or ``sending_null`` and sets to ``received`` or ``received_null``, respectively, and returns ``true``. If on different ranks, first calls ``TryStartReceive()`` then, if the ``MPI_Irecv`` has been posted tests wether or not it has been completed. If it has, sets the buffer state to ``received`` or ``received_null`` depending on the size of the incoming message and returns ``true``. Otherwise returns ``false``. - ``Stale()``: Sets the state to ``stale``. as well as copy constructors, assignment operators, etc. The constructor of ``CommBuffer`` is called as .. code:: cpp CommBuffer(mpi_message_tag, sender_rank, receiver_rank, mpi_communicator, [...capture necessary stuff...](){ return ...allocated object of type T that has the desired size...; }, do_sparse_allocation); The lambda passed to the constructor is stored as a field in the class and is called when the internal storage buffer needs to be allocated (see ``BuildBoundaryBuffers`` in ``sparse_bvals_cc_in_one.cpp`` for an example usage). Aside from during construction, there should be no difference in useage between a same rank to same rank ``CommBuffer`` and a separate rank ``CommBuffer``. *Note that setting ``do_sparse_allocation = true`` minimizes the memory allocated for sparse variables but may result in slower MPI communication since ``MPI_Irecv`` can’t be posted until the incoming message size is known. In simple tests, it appears that this does not give a significant slow down, so all ``Metadata::Sparse`` variables use sparse allocation. If in the future there is a need to turn this on and off on a per variable there is a flag, ``Metadata::SparseCommunication``, that can be set for variables to make them use this memory minimizing communication pattern. This would also be required a change in ``BuildBoundaryBuffers`` switching how the flag for using sparse buffers is set.* ``class ObjectPool`` ~~~~~~~~~~~~~~~~~~~~~~~ An ``ObjectPool`` hands out reference counted objects that publicly inherit from ``T``, which is assumed to be something like a Kokkos view which has an assignment operator and copy ctor that perform shallow copies. Rather than creating a new object each time one is requested, an ``ObjectPool`` recycles previously created objects that have been released back to the pool to limit the number of times that objects need to allocated (in particular on device). When the class method ``Get()`` is called a pre-allocated, free resource is handed out if one is available, otherwise a new resource is created. An object pool has two types of objects it can hand out, .. code:: cpp class ObjectPool::weak_t : public T {...}; class ObjectPool::owner_t : public weak_t {...}; both of which contain a pointer to the object pool that handed them out and a key that identifies the unique resource in the pool that they reference. An ``owner_t`` object contributes to the reference count for a particular resource. When the reference count goes to zero in an ``owner_t`` dtor, the underlying resource is added to the stack of available objects and the key associated with the resource is invalidated, but a copy of it still exists (so, for instance, the reference count of an underlying ``Kokkos::View`` will not go to zero and its memory will not be released). A ``weak_t`` object does not contribute to the reference count and its underlying resource can become invalid. The validity of an object can be checked with the member method ``bool IsValid()``. The mechanism by which new pool objects are created is specified by a lambda passed to the ``ObjectPool`` constructor: .. code:: cpp template using dev_arr_t = typename Kokkos::View; int N = 100; Pool>([N](){ return dev_arr_t("test pool", N); }); On host a lot of this functionality could be replicated with ``shared_ptr`` I think, but it is somewhat useful for these objects to be able to exist on device (even though the reference counting doesn’t work there). Sparse boundary communication implementation -------------------------------------------- The tasks for sparse cell centered variable boundary communication pretty closely mirror the old ``bvals_in_one`` tasks but allow for allocation and deallocation of the communication buffers on the fly. The ``BndInfo`` class (which comes from the old bvals_in_one implementation) stores the index ranges and data arrays necessary for copying from the mesh to a buffer or vice versa. These are cached within a one dimensional par array so that loading and unloading of buffers can done “in one” kernel launch on device (in exactly the same way as the old ``bvals_in_one`` setup). The ``BndInfo`` objects are cached in each ``MeshData`` object within a ``BvarsCache_t`` object (which can contain a number of sub-caches ``BvarsSubCache_t`` that correspond to different subsets of communication channels) to limit the number of deep copies from host to device. We also add a map of object pools containing pools for various buffer sizes. .. code:: cpp template using buf_pool_t = ObjectPool> std::unordered_map> pool_map; As well as the map from communication channel keys to communication buffers associated with each channel .. code:: cpp using channel_key_t = std::tuple; std::unordered_map::owner_t>> boundary_comm_map; Note that every stage shares the same ``CommBuffer``\ s, but we keep separate buffers for boundary value communication and flux correction communication so these operations can occur concurrently if necessary. Send and Receive Ordering ~~~~~~~~~~~~~~~~~~~~~~~~~ In each cache, we build a ``std::vector*> send_buf_vec, recv_buf_vec`` which contains pointers to every communication channel associated with sending/receiving from a ``MeshData`` object. This is useful for a couple of reasons. First, this speeds up the code by reducing the number of times a ``std::unordered_map`` lookup from the ``boundary_comm_map`` is required. Second, we allow for any ordering of ``*_buf_vec`` (by including a secondary array for indexing between sequential index defined by the order of ``ForEachBoundary`` to the index in the buffer cache). The ordering of this vector determines the order in which ``MPI_Isend`` and ``MPI_Irecv`` calls are posted, which can impact the communication performance. *This is something that can be experimented with for optimal performance. Strangely, I have seen the best results on test problems for random ordering, but it is not clear if this generalizes to more realistic problems not being run with all ranks on the same node. See ``InitializeBufferCache(...)`` for how to choose the ordering.* .. _boundary_comm_tasks: Boundary Communication Tasks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. topic:: ``BuildBoundaryBuffers(std::shared_ptr>&)`` * Iterates over communication channels sending or receiving from blocks in ``md``. For every sending channel it creates a communication channel for each in the ``Mesh::boundary_comm_map``. For receiving channels where the blocks are on different ranks, it also creates a receiving channel in ``Mesh::boundary_comm_map`` since the sender will not add this channel on the current rank. Also creates new ``buf_pool_t``\ s for the required buffer sizes if they don’t already exist. Note that no memory is saved for the communication buffers at this point. * This is called during ``Mesh::Initialize(...)`` and during ``EvolutionDriver::InitializeBlockTimeStepsAndBoundaries()`` and before this task is called ``Mesh::boundary_comm_map`` is cleared. **This should not be called in downstream code.** .. topic:: ``SendBoundBufs(std::shared_ptr>&)`` * Iterates over boundaries of ``bound_type``, which supports ``any``, ``local`` which implies the communicating blocks are on the same rank, and ``nonlocal`` which implies differing ranks for the two blocks. Could have one task with ``any`` or split communication into separate ``local`` and ``nonlocal`` tasks. * ``SendBoundaryBuffers`` is just an alias for ``SendBoundBufs`` to ensure backward compatibility. * Allocates buffers if necessary based on allocation status of block fields and checks if ``MeshData::send_bnd_info`` objects are stale. * Rebuilds the ``MeshData::send_bnd_info`` objects if they are stale * Restricts where necessary * Launches kernels to load data from fields into buffers, checks whether any of the data is above the sparse allocation threshold. * Calls ``Send()`` or ``SendNull()`` from all of the boundary buffers depending on their status. .. topic:: ``StartReceiveBoundBufs(std::shared_ptr>&)`` * Iterates over boundaries of ``bound_type``, which supports ``any``, ``local`` which implies the communicating blocks are on the same rank, and ``nonlocal`` which implies differing ranks for the two blocks. This is a no-op for ``local`` boundaries. * Posts/tries to post an ``MPI_Irecv`` for receiving buffers. For performance, it is often necessary to call this early in the task list before the rest of the communication routines get called. Codes will produce correct results without ever calling this task though. .. topic:: ``ReceiveBoundBufs(std::shared_ptr> &)`` * Iterates over boundaries of ``bound_type``, which supports ``any``, ``local`` which implies the communicating blocks are on the same rank, and ``nonlocal`` which implies differing ranks for the two blocks. Could have one task with ``any`` or split communication into separate ``local`` and ``nonlocal`` tasks. * ``ReceiveBoundaryBuffers`` is just an alias for ``ReceiveBoundBufs`` to ensure backward compatibility. * Tries to receive from each of the receive channels associated with ``md`` of the chosen boundary type. * If the receive is succesful, and allocation of the associated field is required, allocate it. .. topic:: ``SetBounds(std::shared_ptr>& md)`` * Iterates over boundaries of ``bound_type``, which supports ``any``, ``local`` which implies the communicating blocks are on the same rank, and ``nonlocal`` which implies differing ranks for the two blocks. Could have one task with ``any`` or split communication into separate ``local`` and ``nonlocal`` tasks. * ``SetBoundaries`` is just an alias for ``SetBounds`` to ensure backward compatibility. * Check if ``MeshData::recv_bnd_info`` needs to be rebuilt because of changed allocation status. * Rebuild ``MeshData::recv_bnd_info`` if necessary. * Launch kernels to copy from buffers into fields or copy default data into fields if sending null. * Stale the communication buffers. * Restrict ghost regions where necessary to fill prolongation stencils. Flux Correction Tasks ~~~~~~~~~~~~~~~~~~~~~ Flux correction is required to ensure conservation at fine-coarse boundaries, as the sum of fluxes on a fine block corresponding to a single flux on a coarse neighbor block is not guaranteed to be equal to the coarse flux. To ensure conservation, coarse fluxes at fine-coarse boundaries are replaced with the sums of fine fluxes in an interblock communication step separate from filling ghost zones. Nevertheless, Parthenon uses the same machinery for flux correction as ghosts communication since they are essentially the same operation, just on different fields and different index ranges. Flux correction can be implemented in a task list via the calls - ``StartReceiveBoundBufs(std::shared_ptr>&)`` - ``SendBoundBufs(std::shared_ptr>&)`` - ``ReceiveBoundBufs(std::shared_ptr>&)`` - ``SetBoundBufs(std::shared_ptr>&)`` which cause fields with `Metadata::Flux` set to restrict and then communicate only shared elements only on fine to coarse boundaries (as was done in legacy Parthenon versions of flux correction that only worked for cell variable flux correction). Notice that the send operation requires a different ``BoundaryType`` than the receive operation, unlike regular boundary communication. All of these tasks can be added to a task list in one call using ``AddFluxCorrectionTasks``. Flux correction for sparse variables and dense variables is very similar, the only difference being that for sparse variables if either the fine or the coarse block is unallocated no flux correction occurs. Flux correction communication cannot trigger allocation. For backwards compatibility, we keep the aliases - ``StartReceiveFluxCorrections`` = ``StartReceiveBoundBufs`` - ``LoadAndSendFluxCorrections`` = ``SendBoundBufs`` - ``ReceiveFluxCorrections`` = ``ReceiveBoundBufs`` - ``SetFluxCorrections`` = ``SetBoundBufs``