Dagger.jl Fast, Smart Parallelism

Faster MPI Integration in Dagger

  • Written by Yan Guimarães

Hello everyone! I hope you're all doing well as you continue developing your Julia code. I am writing this post to share my progress with the Google Summer of Code project, specifically regarding the MPI integration in Dagger. I would also like to share our expectations for the project as we approach the end of summer.

I would like to begin by expressing my genuine gratitude to my incredible mentors, @fda-tome and @jpsamaroo. They consistently dedicated their time to meeting with me every week and discussing various topics through Slack. Without their support, I wouldn't have been able to develop my Julia code like we achieved.

I look forward to continuing to work with them. Thank you for this opportunity and your mentorship!

Development so far

Send/Recv in-place

Initially, the MPI implementation in Dagger relied on sending and receiving serialized objects. To improve performance, we established a foundation for in-place MPI operations. We implemented the function supports_inplace_mpi(value), which checks if the data type is a DenseArray and if the type of its contained elements is an isbitstype. The approach ensures that data is stored contiguously in memory and is compatible with MPI buffer.

To leverage contiguous memory for improved performance, we created a new function, send_yield!, which included an inplace attribute. If it was set to true, the function used the supports_inplace_mpi check to determine whether the data could be sent directly using MPI.ISend. If the attribute was false, the data was serialized and sent using MPI.isend.

A significant challenge for in-place receiving was identifying the incoming data's type before the actual buffer arrived. To address this, we introduced the recv_yield! function, which included a buffer attribute. It allowed the receiver to predefine the buffer and verify whether it could be used in-place, similar to what we did on the send side.

All the code mentioned can be found in Commit e252c41, along with demo tests for the MPI changes.

in-place logic to out-of-place send/recv

We enhanced Dagger's communication mechanisms by integrating in-place logic for Arrays and SparseArrays. This generated PR#624, and in it, we developed a new structure that encapsulated the metadata of the message. This struct was serialized and sent first. Upon reception, the receiving end used the metadata information to allocate a buffer of the correct type and size, enabling the direct, in-place reception of the main message.

From a quick benchmarking of a Dagger FFT implementation, using yg/faster-mpi, we reported a 2× performance improvement—reducing runtime from 2 seconds to under 1 second—when compared to jps/dev2025. This is a strong indication that the in-place logic integration is already making a meaningful impact on real workloads.

Next milestones

Leveraging the significant performance improvements from in-place MPI communication, our next focus is to establish a benchmarking environment on a larger parallel machine for testing with a substantial number of threads and data.

Additionally, we aim to complete and rigorously test the RMA Windows implementation for broadcast by the end of this week. The implementation involves using one window to hold metadata while the other is for the data payload. This initial step will serve as a crucial introduction to managing windows within the context of Dagger communication.

Next, we plan to extend RMA windows to general Dagger task execution. The goal is to apply the same strategy used for broadcast to task creation via @spawn and spawn. This is more of an idea that we will test to see if it is worthwhile, as it may not be beneficial to perform a collective operation for every task to allocate windows.

Finally, good documentation is a priority. Once base operations are ready to merge, we will create clear documentation to support usage, future contributions, and further development.

CC BY-SA 4.0 Julian P Samaroo. Last modified: July 16, 2025.
Website built with Franklin.jl and the Julia programming language.