Parallel Minimum Spanning Tree Library (Online)

Summary

We are going to implement a parallel graph processing library in OpenMP for computing minimum spanning trees dynamically with low-latency graph update. By parallelizing algorithms like Prim's and Boruvka's, we aim to trade strict optimality of the generated MST for performance. As an extension, we plan to implement a CUDA-based feature to generate diagnostic and analytical information on the graph objects.

Background

Minimum spanning tree (MST) is a fundamental problem crucial for tasks like network routing, clustering, and pathfinding. In many real-world applications, graphs are dynamic and experience frequent updates. It is computationally expensive to recompute the MST for every update, yet popular MST algorithms like Prim’s assume a static graph as input. Worse, most existing MST algorithms are sequential and do not scale to benefit from a parallel system. To resolve this problem, our project aims to create a parallel library that handles dynamic graph updates while maintaining a close-to-optimal MST. We will accomplish this by parallelizing standard MST algorithms such as Prim's and Boruvka's using OpenMP.

The primary strategies of this problem that can benefit from parallelism are parallel vertex operations and graph partitioning. In our project, we explore both options by implementing these techniques as generic parallel primitives, which we then specialize into parallelized versions of MST algorithms.

The Challenge

Workload: Graph traversals typically have poor spatial locality because vertices can have different degrees and an edge can connect two vertices that are scattered in the memory. Moreover, graphs have uneven connectivity and complex dependencies, which makes parallelization difficult.
Constraints: When using a graph partitioning strategy to compute local sub-solutions, combining these local structures on request will likely result in a high communication to computation ratio.
Implementation: Choosing the right granularity for distributing work on the graph (too small: solution quality suffers as the MST will be too local and fail to capture global structure; too big: underutilizes resources). Another challenge is finding a partitioning or scheduling heuristic that allows parallelism without degrading solution quality too much.

Resources

We will use the GHC and PSC machines to test our shared-memory OpenMP implementations on different numbers of threads. We plan to build the core graph processing library and MST algorithms from scratch using C++.

We are planning to reference the following resources:

15-210 Textbook (parallel graph algorithms section)
“A Shared-Memory Algorithm for Updating Tree-Based Properties of Large Dynamic Networks”: https://ieeexplore.ieee.org/document/8464249

Goals (Plan to Achieve)

Implement vertex parallelization strategy by using OpenMP to parallelize on the vertices on the graph to process multiple vertex contractions and edge relaxations concurrently.
Implement graph partition strategy by partitioning the graph into subgraphs and parallelizing each of them. We will obtain a local solution/data structure for each subgraph, then combine them on request to form the overall solution.
Achieve 4x speedup on the GHC machines with 8 threads (on the baseline graph, a hypothetical graph with uniform connection patterns).
Achieve better-than-sequential speedup on ANY graph (including adversarial inputs).

Goals (Hope to Achieve)

CUDA extension to generate diagnostic/analytical information on graph objects in our library.
Match the performance of MST generation described in “A Shared-Memory Algorithm…” paper with our implementation with parallel graph primitives.
The library should automatically choose the best heuristic based on analytical information on the graph.

Platform of Choice

We plan to choose OpenMP as our platform. It makes sense because graph processing requires cheap communication and heterogeneous control-flow, so OpenMP’s light-weighted threads are very suitable. We plan on using a shared-memory machine for low communication cost and cache-friendliness.

Schedule

By April 3: vertex-based parallelism, baseline & specialized graph data for benchmarking
By April 6: graph partitioning primitives, graph analytics with CUDA
By April 12: fully implement Prim’s and Boruvka's, start on milestone report
By April 18: refinement & optimization, benchmark baseline for final report
By April 22: optimize, run on GHC & collect speedup & cache data, start on final report
By April 27: run on PSC, improve scaling, finish final report