Review

NVIDIA RTX 5090 P2P Support Enabled by Patched Driver

  • Updated December 12, 2025
  • Tilda Thiel
  • 34 comments

A patched fork of the open GPU kernel modules driver has successfully enabled peer-to-peer (P2P) functionality for multiple NVIDIA RTX 5090 graphics cards, a capability that was problematic in the previous “tinygrad” version. After switching to the new fork, a system with two RTX 5090s operating at PCIe 5.0 x8 demonstrated a significant performance uplift in P2P bandwidth and latency tests. With P2P enabled, unidirectional bandwidth between the cards increased from approximately 24 GB/s to 28 GB/s, while bidirectional bandwidth surged from around 30 GB/s to nearly 56 GB/s. Most notably, latency plummeted from 14 microseconds to just 0.48 microseconds.

The driver’s P2P support extends beyond the new GPUs. In a more complex test system containing seven total GPUs—including two RTX 5090s, two RTX 4090s, two RTX 3090s, and an RTX A6000—P2P functionality was confirmed between the paired 4090s, the paired 3090s, and between the 3090s and the A6000. While the 5090s and 4090s were connected directly to CPU PCIe lanes, the 3090s and A6000 shared chipset lanes, which introduced a performance penalty. Even with this constraint, enabling P2P provided substantial latency improvements, such as reducing inter-GPU latency between 4090s from 24 microseconds to 5 microseconds. Preliminary inference benchmarks with models like Mistral Large and GLM 4.5 show tangible throughput gains when Tensor Parallelism is combined with the new P2P driver, particularly in configurations that avoid PCIe bottlenecks.

Choose a language:

34 Comments

  1. Wow, that latency drop from 14 microseconds to under half a microsecond is absolutely game-changing for multi-GPU compute tasks. I’ve been wrestling with communication overhead between my 4090s for simulation work, and this kind of optimization is exactly what we need. Makes me wonder if this patched driver could eventually trickle down to benefit older GPU pairings in professional workloads too.

    1. Absolutely, that drop to sub-microsecond latency is indeed a game-changer for simulations and other tightly coupled multi-GPU tasks. The patched driver’s P2P support does extend to other GPUs, as noted in the article’s more complex seven-GPU test, so it’s worth checking the project’s repository for compatibility lists and testing with your 4090s. I’d love to hear if you give it a try and what performance gains you see with your specific workload.

  2. That latency drop from 14 microseconds to 0.48 is absolutely wild—it’s the kind of improvement that could completely change how we design multi-GPU compute tasks. I’ve been wrestling with P2P bottlenecks on my 4090 setup for distributed rendering, so seeing a patched driver unlock this for the 5090s gives me hope for future updates on older cards. Has anyone tried this fork with a mixed 30/40 series setup yet?

    1. I completely agree—that drop to 0.48 microseconds is a game-changer for multi-GPU workloads like your distributed rendering. While the article specifically highlights success with the RTX 5090s, the patched driver’s P2P support is noted to extend to other GPUs, so testing a mixed 30/40 series setup is a logical next step. I’d recommend checking the fork’s repository or community discussions for any user reports on compatibility with older architectures, and please do share what you find if you give it a try!

  3. That latency drop from 14 microseconds to 0.48 is absolutely wild—it makes me wonder how much this could speed up multi-GPU rendering workflows I’ve struggled with in the past. I’m definitely going to keep an eye on this driver fork for my own dual 4090 setup. Has anyone tried this patch with a mixed-generation stack like the article’s test system?

    1. That latency drop really is staggering, and it’s exciting to think about the impact on multi-GPU rendering. The article notes the patched driver’s P2P support extends beyond the 5090s, as it was validated in a complex test system with seven total GPUs of mixed generations, so your dual 4090 setup is a prime candidate. I’d recommend checking the fork’s repository or community discussions for specific user reports on 40-series performance, and please do share what you find if you give it a try!

    1. I’ve added some benchmark results to the post.

      **Mistral Large 2411 3.5bpw** (using just the two RTX 5090s), at 10K context, with native and NCCL tensor parallelism (TP):
      * TP disabled: 16 tokens/second
      * TP enabled, no P2P: 16 t/s
      * TP enabled (native), with P2P: 20 t/s
      * TP enabled (NCCL), with P2P: 21 t/s

      **GLM 4.5 4bpw** (using all 7 GPUs), at 32K context. Note: This runs slowly due to a PCIe bandwidth bottleneck, so the base speeds are low. Using native TP:
      * TP disabled: 16 t/s
      * TP enabled, no P2P: 11 t/s (this shows a performance penalty)
      * TP enabled, with P2P: 16 t/s

      For GLM, which is a model with fewer active parameters, using this many GPUs over PCIe 4.0 x4 results in a performance penalty without P2P enabled.

  4. The patch appears to disable NVLink functionality entirely. It also fails to verify ReBAR availability, incorrectly enabling P2P for my 2080 Ti, which doesn’t support it. This implementation is not very clean.

    1. The intended use is for servers with multiple RTX 4090 or RTX 5090 configurations. Having a 2080 Ti in the system has never been tested, so it behaving strangely is not surprising.

      1. This is a server, so I can also install other cards. However, there may be issues when crossing the QPI, which I haven’t tested. I simply used CUDA_VISIBLE_DEVICES to limit the visible GPUs.

    1. Performance gains generally range from 10 to 50%, or up to near 90% with a good parallelism implementation compared to no P2P. The lower gains likely apply to training, while the higher end is more relevant for inference workloads like vLLM or exl3.

      I could test this on vLLM, but I’m unsure of the flag to disable P2P once it’s enabled at the driver level.

    2. I’ve added some results to the post.

      For Mistral Large 2411 at 3.5bpw, using just the two RTX 5090s at 10K context with native and NCCL tensor parallelism (TP):
      * TP disabled: 16 tokens/second
      * TP enabled, no P2P: 16 t/s
      * TP enabled (native), with P2P: 20 t/s
      * TP enabled (NCCL), with P2P: 21 t/s

      For GLM 4.5 at 4bpw, using all seven GPUs at 32K context with native TP (note: this runs slow due to a PCIe bandwidth bottleneck):
      * TP disabled: 16 t/s
      * TP enabled, no P2P: 11 t/s (a performance penalty)
      * TP enabled, with P2P: 16 t/s

      For GLM, which is a model with fewer active parameters and is spread across many GPUs on PCIe 4.0 x4, there is a performance penalty without P2P.

      In these specific cases, enabling P2P yields a performance improvement between 20% and 45%.

      1. What is TP? It refers to tensor parallelism, which is what vLLM uses (I haven’t run it myself, though). Do you think you might be saturating the PCIe bus?

        1. This is a similar but distinct implementation.

          When using only the RTX 5090s, P2P saturates the bus and provides about 25-30% more performance compared to not using it.

          When using all GPUs together, P2P only works between identical models, not across all cards. This creates a PCIe bottleneck.

  5. It’s odd that my fork works for you when theirs doesn’t. I only rebased their changes to version 580 to support CUDA 13 and added more detailed instructions for enabling IOMMU passthrough mode. I haven’t made any major changes to the patch itself. You might have been on a branch without RTX 5090 support.

    Since exl3 now uses NCCL, it can automatically use P2P functionality, as will vLLM, sglang, TensorRT-LLM, and similar tools.

    I’m curious whether we can get P2P working between GPUs from different generations. That appears to be the only part not functioning yet.

    1. The RTX 3090 does not support NVLink for this configuration. On my system, I can establish P2P between cards within the PLX, with an NVLink bridge across the PLX. Currently, the bandwidth is divided, but it’s still an improvement over having no connection.

    2. I was on the correct tinygrad branch, but the RTX 5090s never worked. Perhaps the missing step was IOMMU passthrough.

      In any case, thank you for the fork. Having the latest version with P2P support is a big help.

Leave a Reply