C++#

RapidsMPF exposes a full C++ API for building high-performance distributed GPU workloads without a Python runtime. The C++ layer is the foundation on which the Python bindings are built.

The C++ API reference is available at docs.rapids.ai/api/librapidsmpf/nightly.

Coverage#

The C++ API provides access to all core RapidsMPF subsystems:

  • Communicator — MPI and UCXX backends for inter-process communication.

  • Shuffler — Out-of-core, distributed payload shuffle service.

  • Streaming Engine — Asynchronous multi-GPU pipeline with Channels, Actors, and Messages.

  • Memory — BufferResource, spilling, pinned memory, and packed data utilities.

  • Config — Configuration options and environment-variable parsing.

Shuffle Service#

See Shuffle Architecture for an in-depth explanation of the shuffle design.

rrun — Distributed Launcher#

RapidsMPF includes rrun, a lightweight launcher that eliminates the MPI dependency for multi-GPU workloads. See Streaming execution for more on the programming model.

Build rrun#

cd cpp/build
cmake --build . --target rrun

Single-Node Launch#

# Launch 2 ranks on the local node
./tools/rrun -n 2 ./benchmarks/bench_comm -C ucxx -O all-to-all

# With verbose output and specific GPUs
./tools/rrun -v -n 4 -g 0,1,2,3 ./benchmarks/bench_comm -C ucxx