We Benchmarked H100 vs L40S for ZK Proofs — Here’s What Scales (and What Doesn’t)
Zero-knowledge (ZK) proof systems are becoming a cornerstone technology for privacy-preserving and scalable computation in blockchain and cryptographic applications. As proof complexity and throughput demands grow, optimizing hardware utilization becomes essential to maintain performance and cost-efficiency — particularly in GPU-accelerated proving pipelines.
We at P2P.org have participated in most of the major ZK protocols via different sets of ZK prover hardware. Since Ethereum is moving towards ZK enshrined in the protocol with L2Beat-like “slices” overview projects popping up (https://ethproofs.org/), we wanted to provide the community with an example of one of our researches based on our gathered knowledge on the subject.
This study examines GPU utilization strategies for generating ZK proofs, comparing two leading GPU architectures: the NVIDIA H100 and L40S. The main objective is to evaluate whether allocating multiple GPUs to a single proof improves performance more effectively than generating multiple proofs in parallel, each using a single GPU.
Our benchmark is based on Scroll’s open-source ZK prover implementation, deployed on two high-performance hardware platforms. Below are the technical specifications for each setup:
Hardware Specifications
- L40S System:
- CPU: 2× AMD EPYC 9354 (3.80 GHz)
- RAM: 2 TB
- GPU: 8× NVIDIA L40S 48 GB
- Storage: 4× 4 TB NVMe SSD
- Network: 2× 10 Gbit NICs
- H100 System:
- CPU: Intel Xeon 8481C (2.7 GHz, 208 cores)
- RAM: 1.8 TB
- GPU: 8× NVIDIA H100 80 GB
- Storage: 12× 400 GB NVMe SSD
- Network: 1× 100 Gbit NIC
Using a fixed 8-GPU configuration, we tested two modes: (1) increasing the number of GPUs per proof to measure time reduction, and (2) running multiple proofs concurrently to assess total throughput. This section sets the foundation for analyzing the performance trade-offs, CPU/GPU bottlenecks, and real-world cost-effectiveness of ZK proof generation at scale.
Benchmarking ZK Prover Performance: Parallelization vs Dedicated GPUs
To evaluate GPU utilization efficiency in zero-knowledge proof generation, we conducted a series of controlled benchmarks on both hardware setups — L40S and H100 — using 8 GPUs in each case. The goal was to compare two strategies:
- Strategy A: Increasing the number of GPUs used for generating a single proof.
- Strategy B: Running multiple proofs in parallel, with one GPU assigned per proof.
The Scroll open-source prover was used as the testing framework across both systems. Each configuration was run with fixed parameters and measured for prover time, proof throughput (proofs per day), and system resource utilization (CPU, GPU memory, RAM). Below are the summarized results:
L40S Results
Configuration | Prover Time (s) | Proofs per Day |
---|---|---|
1 GPU on 1 proof | 792 | 109 |
2 GPUs on 1 proof | 705 | 122 |
4 GPUs on 1 proof | 672 | 128 |
8 GPUs on 1 proof | 688 | 125 |
8 GPUs, 8 proofs in parallel | 1420 (total), 60.8 per GPU | 486 total |
H100 Results
Configuration | Prover Time (s) | Proofs per Day |
---|---|---|
1 GPU on 1 proof | 1047 | 82 |
2 GPUs on 1 proof | 892 | 97 |
4 GPUs on 1 proof | 824 | 105 |
8 GPUs on 1 proof | 803 | 108 |
8 GPUs, 8 proofs in parallel | 2400 (total), 36 per GPU | 288 total |
These results demonstrate that assigning a single GPU to each proof and executing them in parallel yields significantly higher overall throughput, especially on the L40S system. Surprisingly, H100 performance gains from parallelization were underwhelming, despite its raw power advantage, suggesting suboptimal software utilization or architectural bottlenecks in the current prover setup.
On the graph we have shown the efficiency we expected to have by adding GPUs with the green line. The red dot on the graph is the generation of 8 ZK proofs simultaneously on the same 8-GPU unit, while the blue line is the result we received by adding GPUs to the proof generation process.
System Resource Utilization During Proof Generation
In addition to measuring prover time and throughput, we monitored system-level resource usage to better understand the efficiency and scaling behavior of each GPU configuration. Metrics recorded include peak CPU utilization, maximum GPU memory usage, and RAM consumption across different levels of parallelism.
L40S - Resource Metrics
- 1 GPU on 1 proof: 672s — CPU: 45%, GPU Memory: 24 GB, RAM: 180 GB
- 2 GPUs on 1 proof: 672s — CPU: 60%, GPU Memory: 24 GB, RAM: 180 GB
- 4 GPUs on 1 proof: 672s — CPU: 60%, GPU Memory: 24 GB, RAM: 180 GB
- 8 GPUs on 1 proof: 688s — CPU: 45%, GPU Memory: 12 GB, RAM: 180 GB
- 8 GPUs on 8 proofs (parallel): 1420s — CPU: 100%, GPU Memory: 24 GB, RAM: 1300 GB
H100 - Resource Metrics
- 1 GPU on 1 proof: 1047s — CPU: 45%, GPU Memory: 46 GB, RAM: 180 GB
- 2 GPUs on 1 proof: 892s — CPU: 60%, GPU Memory: 46 GB, RAM: 180 GB
- 4 GPUs on 1 proof: 824s — CPU: 60%, GPU Memory: 24 GB, RAM: 180 GB
- 8 GPUs on 1 proof: 803s — CPU: 60%, GPU Memory: 12 GB, RAM: 180 GB
- 8 GPUs on 8 proofs (parallel): 2400s — CPU: 100%, GPU Memory: 46 GB, RAM: 1300 GB
The results indicate that running proofs in parallel leads to near full CPU saturation and significantly increased RAM consumption. This suggests that CPU becomes a limiting factor under heavy GPU parallelism unless paired with a properly scaled memory and compute environment.
While GPU memory usage scales linearly with the number of concurrent proofs, the per-proof RAM usage becomes substantial when 8 parallel jobs are running, particularly on H100 hardware.
The RAM usage remains constant at 180 GB across all configurations (1, 2, 4, and 8 GPUs). This suggests that the memory allocation for the proof generation process is not dependent on the number of GPUs involved.
It is likely that the proving software either preallocates the required system memory at the start of the process or that the computational workload is primarily offloaded to the GPU, resulting in negligible variation in RAM consumption.
This behavior indicates that system RAM is not a limiting factor in the scaling of proof generation on the H100 hardware — at least when generating a single proof, regardless of GPU count.
When analyzing GPU memory usage on the H100 for single-proof generation, a clear trend emerges: GPU memory consumption decreases as more GPUs are allocated to the task.
With 1 GPU, the memory usage peaks at 46 GB, but as the workload is distributed across 2, 4, and eventually 8 GPUs, the consumption per GPU drops to 12 GB in the 8-GPU configuration.
This behavior is consistent with the expectation that dividing the computation across more GPUs reduces per-device memory pressure, as intermediate states and computational graphs are split and processed concurrently.
However, despite the lower memory usage, the overall proving time did not improve significantly, suggesting that GPU memory was not the bottleneck. This reinforces the observation that parallel GPU allocation alone is not sufficient to accelerate ZK proof generation without corresponding improvements in software or CPU coordination.
Conclusion
This benchmark study evaluated the performance and hardware efficiency of generating zero-knowledge proofs using two enterprise-grade GPU configurations: the NVIDIA H100 and NVIDIA L40S. The analysis was conducted using Scroll's open-source prover, with a focus on two key strategies: scaling a single proof across multiple GPUs versus running multiple proofs in parallel.
The results demonstrate that parallel generation of proofs using individual GPUs yields significantly better throughput than assigning all GPUs to a single proof process. This effect is especially visible on the L40S platform, where parallel execution nearly quadrupled the number of proofs generated per day compared to the single-proof setup.
Surprisingly, the H100 — despite its superior hardware specs — underperformed in this scenario. Its single-proof generation times were longer than L40S in all configurations, and parallel execution on H100 also delivered lower throughput, indicating that software bottlenecks or suboptimal utilization patterns may limit its current viability for ZK workloads.
Additionally, we found that system RAM and GPU memory were not primary limiting factors in most configurations. RAM usage remained constant during single-proof runs, while GPU memory usage decreased as GPU count increased. Instead, CPU saturation and parallel processing coordination appear to be more critical for maximizing performance in proof generation.
In conclusion, GPU parallelism for a single proof does not scale efficiently beyond a certain point. ZK infrastructure teams aiming to improve throughput should prioritize software optimization, better CPU/GPU coordination, and parallelization across proofs rather than within a single one.