Konstantin

We Benchmarked H100 vs L40S for ZK Proofs — Here’s What Scales (and What Doesn’t)

<p>Zero-knowledge (ZK) proof systems are becoming a cornerstone technology for privacy-preserving and scalable computation in blockchain and cryptographic applications. As proof complexity and throughput demands grow, optimizing hardware utilization becomes essential to maintain performance and cost-efficiency — particularly in GPU-accelerated proving pipelines.</p><p>We at <a href="http://p2p.org/?ref=p2p.org"><u>P2P.org</u></a> have participated in most of the major ZK protocols via different sets of ZK prover hardware. Since Ethereum is moving towards ZK enshrined in the protocol with L2Beat-like “slices” overview projects popping up (https://ethproofs.org/), we wanted to provide the community with an example of one of our researches based on our gathered knowledge on the subject.</p><p>This study examines GPU utilization strategies for generating ZK proofs, comparing two leading GPU architectures: the <strong>NVIDIA H100</strong> and <strong>L40S</strong>. The main objective is to evaluate whether allocating multiple GPUs to a <em>single</em> proof improves performance more effectively than generating <em>multiple proofs in parallel</em>, each using a single GPU.</p><p>Our benchmark is based on Scroll’s open-source ZK prover implementation, deployed on two high-performance hardware platforms. Below are the technical specifications for each setup:</p><h3 id="hardware-specifications"><strong>Hardware Specifications</strong></h3><ul><li><strong>L40S System:</strong><ul><li>CPU: 2× AMD EPYC 9354 (3.80 GHz)</li><li>RAM: 2 TB</li><li>GPU: 8× NVIDIA L40S 48 GB</li><li>Storage: 4× 4 TB NVMe SSD</li><li>Network: 2× 10 Gbit NICs</li></ul></li><li><strong>H100 System:</strong><ul><li>CPU: Intel Xeon 8481C (2.7 GHz, 208 cores)</li><li>RAM: 1.8 TB</li><li>GPU: 8× NVIDIA H100 80 GB</li><li>Storage: 12× 400 GB NVMe SSD</li><li>Network: 1× 100 Gbit NIC</li></ul></li></ul><p>Using a fixed 8-GPU configuration, we tested two modes: (1) increasing the number of GPUs per proof to measure time reduction, and (2) running multiple proofs concurrently to assess total throughput. This section sets the foundation for analyzing the performance trade-offs, CPU/GPU bottlenecks, and real-world cost-effectiveness of ZK proof generation at scale.</p>  <section id="experiment"> <h2>Benchmarking ZK Prover Performance: Parallelization vs Dedicated GPUs</h2> <p> To evaluate GPU utilization efficiency in zero-knowledge proof generation, we conducted a series of controlled benchmarks on both hardware setups — L40S and H100 — using 8 GPUs in each case. The goal was to compare two strategies: </p> <ul> <li><strong>Strategy A:</strong> Increasing the number of GPUs used for generating a single proof.</li> <li><strong>Strategy B:</strong> Running multiple proofs in parallel, with one GPU assigned per proof.</li> </ul> <p> The Scroll open-source prover was used as the testing framework across both systems. Each configuration was run with fixed parameters and measured for prover time, proof throughput (proofs per day), and system resource utilization (CPU, GPU memory, RAM). Below are the summarized results: </p> <h3>L40S Results</h3> <table border="1" cellpadding="8" cellspacing="0"> <thead> <tr> <th>Configuration</th> <th>Prover Time (s)</th> <th>Proofs per Day</th> </tr> </thead> <tbody> <tr> <td>1 GPU on 1 proof</td> <td>792</td> <td>109</td> </tr> <tr> <td>2 GPUs on 1 proof</td> <td>705</td> <td>122</td> </tr> <tr> <td>4 GPUs on 1 proof</td> <td>672</td> <td>128</td> </tr> <tr> <td>8 GPUs on 1 proof</td> <td>688</td> <td>125</td> </tr> <tr> <td>8 GPUs, 8 proofs in parallel</td> <td>1420 (total), 60.8 per GPU</td> <td>486 total</td> </tr> </tbody> </table> <h3>H100 Results</h3> <table border="1" cellpadding="8" cellspacing="0"> <thead> <tr> <th>Configuration</th> <th>Prover Time (s)</th> <th>Proofs per Day</th> </tr> </thead> <tbody> <tr> <td>1 GPU on 1 proof</td> <td>1047</td> <td>82</td> </tr> <tr> <td>2 GPUs on 1 proof</td> <td>892</td> <td>97</td> </tr> <tr> <td>4 GPUs on 1 proof</td> <td>824</td> <td>105</td> </tr> <tr> <td>8 GPUs on 1 proof</td> <td>803</td> <td>108</td> </tr> <tr> <td>8 GPUs, 8 proofs in parallel</td> <td>2400 (total), 36 per GPU</td> <td>288 total</td> </tr> </tbody> </table> <p> These results demonstrate that assigning a single GPU to each proof and executing them in parallel yields significantly higher overall throughput, especially on the L40S system. Surprisingly, H100 performance gains from parallelization were underwhelming, despite its raw power advantage, suggesting suboptimal software utilization or architectural bottlenecks in the current prover setup. </p> <p>On the graph we have shown the efficiency we expected to have by adding GPUs with the green line. The red dot on the graph is the generation of 8 ZK proofs simultaneously on the same 8-GPU unit, while the blue line is the result we received by adding GPUs to the proof generation process.</p> </section>  <h2 id=""></h2><figure class="kg-card kg-image-card"><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcoVS1Xsgm1Lq97e3t799HETaxKKGMdmtPpq7uiVSZ3UJY6GjBlpVk5ywKtq6O-k-O6XFB_4o8cM7Qdp5qozTAVkqlFxl1ixvrS6TWXCZf44CFSzmW_MZgZjsmyijWedj_ds_sP?key=-01FgYzuJbeNBpcm1OIScA" class="kg-image" alt="" loading="lazy" width="1600" height="954"></figure><h2 id="system-resource-utilization-during-proof-generation"><strong>System Resource Utilization During Proof Generation</strong></h2><p>In addition to measuring prover time and throughput, we monitored system-level resource usage to better understand the efficiency and scaling behavior of each GPU configuration. Metrics recorded include peak CPU utilization, maximum GPU memory usage, and RAM consumption across different levels of parallelism.</p><h3 id="l40sresource-metrics"><strong>L40S - Resource Metrics</strong></h3><ul><li><strong>1 GPU on 1 proof:</strong> 672s — CPU: 45%, GPU Memory: 24 GB, RAM: 180 GB</li><li><strong>2 GPUs on 1 proof:</strong> 672s — CPU: 60%, GPU Memory: 24 GB, RAM: 180 GB</li><li><strong>4 GPUs on 1 proof:</strong> 672s — CPU: 60%, GPU Memory: 24 GB, RAM: 180 GB</li><li><strong>8 GPUs on 1 proof:</strong> 688s — CPU: 45%, GPU Memory: 12 GB, RAM: 180 GB</li><li><strong>8 GPUs on 8 proofs (parallel):</strong> 1420s — CPU: 100%, GPU Memory: 24 GB, RAM: 1300 GB</li></ul><h3 id="h100resource-metrics"><strong>H100 - Resource Metrics</strong></h3><ul><li><strong>1 GPU on 1 proof:</strong> 1047s — CPU: 45%, GPU Memory: 46 GB, RAM: 180 GB</li><li><strong>2 GPUs on 1 proof:</strong> 892s — CPU: 60%, GPU Memory: 46 GB, RAM: 180 GB</li><li><strong>4 GPUs on 1 proof:</strong> 824s — CPU: 60%, GPU Memory: 24 GB, RAM: 180 GB</li><li><strong>8 GPUs on 1 proof:</strong> 803s — CPU: 60%, GPU Memory: 12 GB, RAM: 180 GB</li><li><strong>8 GPUs on 8 proofs (parallel):</strong> 2400s — CPU: 100%, GPU Memory: 46 GB, RAM: 1300 GB</li></ul><p>The results indicate that running proofs in parallel leads to near full CPU saturation and significantly increased RAM consumption. This suggests that CPU becomes a limiting factor under heavy GPU parallelism unless paired with a properly scaled memory and compute environment.</p><p>While GPU memory usage scales linearly with the number of concurrent proofs, the per-proof RAM usage becomes substantial when 8 parallel jobs are running, particularly on H100 hardware.</p><figure class="kg-card kg-image-card"><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXfEU47ZweGZ8HObSGwZGOiSyfHcc8nLrFFwCuUFLXpvJVczCMo3ZW3xa4gbR-hIpRlZ84lN26DjwGlvGRdU24Oz82T0ZoeTsbn3vaJfO6zFLDxMyKEPmxKNa18WEDTow6mv3Z3FVg?key=-01FgYzuJbeNBpcm1OIScA" class="kg-image" alt="" loading="lazy" width="1580" height="980"></figure><p>The RAM usage remains constant at <strong>180 GB</strong> across all configurations (1, 2, 4, and 8 GPUs). This suggests that the memory allocation for the proof generation process is not dependent on the number of GPUs involved.</p><p>It is likely that the proving software either <strong>preallocates the required system memory</strong> at the start of the process or that the <strong>computational workload is primarily offloaded to the GPU</strong>, resulting in negligible variation in RAM consumption.</p><p>This behavior indicates that <strong>system RAM is not a limiting factor</strong> in the scaling of proof generation on the H100 hardware — at least when generating a single proof, regardless of GPU count.</p><figure class="kg-card kg-image-card"><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXeCAFQTGwDsBk0j9joRf-E6y1-yTtW9JfwQmjqVto88GnYU5W9g1ctGUR5sJq7uX2I_qBsTGnR6T6fUpghOFvMQE-uA1nwh5Pnuet68ef2mtyCQMVswpY-oxQsVsZQU5qe8xJIrEQ?key=-01FgYzuJbeNBpcm1OIScA" class="kg-image" alt="" loading="lazy" width="1576" height="980"></figure><p>When analyzing GPU memory usage on the H100 for single-proof generation, a clear trend emerges: <strong>GPU memory consumption decreases as more GPUs are allocated to the task</strong>.</p><p>With 1 GPU, the memory usage peaks at <strong>46 GB</strong>, but as the workload is distributed across 2, 4, and eventually 8 GPUs, the consumption per GPU drops to <strong>12 GB</strong> in the 8-GPU configuration.</p><p>This behavior is consistent with the expectation that dividing the computation across more GPUs reduces per-device memory pressure, as intermediate states and computational graphs are split and processed concurrently.</p><p>However, despite the lower memory usage, the overall proving time did not improve significantly, suggesting that GPU memory was not the bottleneck. This reinforces the observation that <strong>parallel GPU allocation alone is not sufficient to accelerate ZK proof generation</strong> without corresponding improvements in software or CPU coordination.</p><h2 id="conclusion"><strong>Conclusion</strong></h2><p>This benchmark study evaluated the performance and hardware efficiency of generating zero-knowledge proofs using two enterprise-grade GPU configurations: the <strong>NVIDIA H100</strong> and <strong>NVIDIA L40S</strong>. The analysis was conducted using Scroll's open-source prover, with a focus on two key strategies: scaling a single proof across multiple GPUs versus running multiple proofs in parallel.</p><p>The results demonstrate that <strong>parallel generation of proofs using individual GPUs</strong> yields significantly better throughput than assigning all GPUs to a single proof process. This effect is especially visible on the L40S platform, where parallel execution nearly quadrupled the number of proofs generated per day compared to the single-proof setup.</p><p>Surprisingly, the H100 — despite its superior hardware specs — underperformed in this scenario. Its single-proof generation times were longer than L40S in all configurations, and parallel execution on H100 also delivered lower throughput, indicating that software bottlenecks or suboptimal utilization patterns may limit its current viability for ZK workloads.</p><p>Additionally, we found that <strong>system RAM and GPU memory were not primary limiting factors</strong> in most configurations. RAM usage remained constant during single-proof runs, while GPU memory usage decreased as GPU count increased. Instead, CPU saturation and parallel processing coordination appear to be more critical for maximizing performance in proof generation.</p><p>In conclusion, <strong>GPU parallelism for a single proof does not scale efficiently</strong> beyond a certain point. ZK infrastructure teams aiming to improve throughput should prioritize software optimization, better CPU/GPU coordination, and parallelization across proofs rather than within a single one.</p>

Konstantin

from p2p validator

Subscribe to P2P-economy