Review

NVIDIA RTX 5090 P2P-Unterstützung durch gepatchten Treiber aktiviert

  • Updated December 12, 2025
  • Tilda Thiel
  • 34 comments

Ein gepatchtes Fork des offenen GPU-Kernmoduls treibt erfolgreich die Peer-to-Peer-(P2P)-Funktionalität für mehrere NVIDIA RTX 5090-Grafikkarten ermöglicht, eine Funktion, die in der vorherigen „tinygrad“-Version problematisch war. Nach dem Wechsel auf den neuen Fork zeigte ein System mit zwei RTX 5090s, die im PCIe 5.0 x8-Betrieb arbeiteten, eine erhebliche Leistungssteigerung bei P2P-Bandbreiten- und Latenztests. Mit aktiviertem P2P stieg die einseitige Bandbreite zwischen den Karten von etwa 24 GB/s auf 28 GB/s, während die bidirektionale Bandbreite von etwa 30 GB/s auf fast 56 GB/s anstieg. Am auffälligsten war die deutliche Reduktion der Latenz von 14 Mikrosekunden auf nur noch 0,48 Mikrosekunden.

Die P2P-Unterstützung des Treibers geht über die neuen GPUs hinaus. In einem komplexeren Testsystem mit insgesamt sieben GPUs – darunter zwei RTX 5090s, zwei RTX 4090s, zwei RTX 3090s und eine RTX A6000 – wurde die P2P-Funktionalität zwischen den Paaren der 4090s, den Paaren der 3090s sowie zwischen den 3090s und der A6000 bestätigt. Während die 5090s und 4090s direkt an die CPU-PCIe-Lanes angeschlossen waren, teilten sich die 3090s und die A6000 Chipset-Lanes, was eine Leistungseinbuße verursachte. Selbst unter dieser Einschränkung brachte das Aktivieren von P2P erhebliche Verbesserungen der Latenz, beispielsweise eine Reduzierung der inter-GPU-Latenz zwischen 4090s von 24 Mikrosekunden auf 5 Mikrosekunden. Vorläufige Inferenzbenchmarks mit Modellen wie Mistral Large und GLM 4.5 zeigen messbare Durchsatzverbesserungen, wenn Tensor Parallelismus mit dem neuen P2P-Treiber kombiniert wird, insbesondere in Konfigurationen, die PCIe-Engpässe vermeiden.

Choose a language:

34 Comments

    1. I’ve added some benchmark results to the post.

      **Mistral Large 2411 3.5bpw** (using just the two RTX 5090s), at 10K context, with native and NCCL tensor parallelism (TP):
      * TP disabled: 16 tokens/second
      * TP enabled, no P2P: 16 t/s
      * TP enabled (native), with P2P: 20 t/s
      * TP enabled (NCCL), with P2P: 21 t/s

      **GLM 4.5 4bpw** (using all 7 GPUs), at 32K context. Note: This runs slowly due to a PCIe bandwidth bottleneck, so the base speeds are low. Using native TP:
      * TP disabled: 16 t/s
      * TP enabled, no P2P: 11 t/s (this shows a performance penalty)
      * TP enabled, with P2P: 16 t/s

      For GLM, which is a model with fewer active parameters, using this many GPUs over PCIe 4.0 x4 results in a performance penalty without P2P enabled.

  1. The patch appears to disable NVLink functionality entirely. It also fails to verify ReBAR availability, incorrectly enabling P2P for my 2080 Ti, which doesn’t support it. This implementation is not very clean.

    1. The intended use is for servers with multiple RTX 4090 or RTX 5090 configurations. Having a 2080 Ti in the system has never been tested, so it behaving strangely is not surprising.

      1. This is a server, so I can also install other cards. However, there may be issues when crossing the QPI, which I haven’t tested. I simply used CUDA_VISIBLE_DEVICES to limit the visible GPUs.

    1. Performance gains generally range from 10 to 50%, or up to near 90% with a good parallelism implementation compared to no P2P. The lower gains likely apply to training, while the higher end is more relevant for inference workloads like vLLM or exl3.

      I could test this on vLLM, but I’m unsure of the flag to disable P2P once it’s enabled at the driver level.

    2. I’ve added some results to the post.

      For Mistral Large 2411 at 3.5bpw, using just the two RTX 5090s at 10K context with native and NCCL tensor parallelism (TP):
      * TP disabled: 16 tokens/second
      * TP enabled, no P2P: 16 t/s
      * TP enabled (native), with P2P: 20 t/s
      * TP enabled (NCCL), with P2P: 21 t/s

      For GLM 4.5 at 4bpw, using all seven GPUs at 32K context with native TP (note: this runs slow due to a PCIe bandwidth bottleneck):
      * TP disabled: 16 t/s
      * TP enabled, no P2P: 11 t/s (a performance penalty)
      * TP enabled, with P2P: 16 t/s

      For GLM, which is a model with fewer active parameters and is spread across many GPUs on PCIe 4.0 x4, there is a performance penalty without P2P.

      In these specific cases, enabling P2P yields a performance improvement between 20% and 45%.

      1. What is TP? It refers to tensor parallelism, which is what vLLM uses (I haven’t run it myself, though). Do you think you might be saturating the PCIe bus?

        1. This is a similar but distinct implementation.

          When using only the RTX 5090s, P2P saturates the bus and provides about 25-30% more performance compared to not using it.

          When using all GPUs together, P2P only works between identical models, not across all cards. This creates a PCIe bottleneck.

  2. It’s odd that my fork works for you when theirs doesn’t. I only rebased their changes to version 580 to support CUDA 13 and added more detailed instructions for enabling IOMMU passthrough mode. I haven’t made any major changes to the patch itself. You might have been on a branch without RTX 5090 support.

    Since exl3 now uses NCCL, it can automatically use P2P functionality, as will vLLM, sglang, TensorRT-LLM, and similar tools.

    I’m curious whether we can get P2P working between GPUs from different generations. That appears to be the only part not functioning yet.

    1. The RTX 3090 does not support NVLink for this configuration. On my system, I can establish P2P between cards within the PLX, with an NVLink bridge across the PLX. Currently, the bandwidth is divided, but it’s still an improvement over having no connection.

    2. I was on the correct tinygrad branch, but the RTX 5090s never worked. Perhaps the missing step was IOMMU passthrough.

      In any case, thank you for the fork. Having the latest version with P2P support is a big help.

Antwort hinterlassen