Available Benchmarks
SiliconMark supports various benchmarks for GPU performance testing.- QuickMark - Comprehensive single-node GPU compute and memory performance test
- Cluster Network - Multi-node network connectivity and bandwidth testing
- Inference Benchmark - Multi-engine LLM inference performance (NVIDIA and AMD, using vLLM)
- Llama 3 Inference - Single-node LLM inference performance using NVIDIA NIM
- Llama 3 Fine-Tuning - Single-node LLM fine-tuning performance using NVIDIA NeMo
QuickMark Benchmark
Overview
| Field | Value |
|---|---|
| Benchmark ID | quick_mark |
| Type | Single-node |
| Min Nodes | 1 |
| Description | Comprehensive GPU compute, memory, and interconnect performance test |
Configuration
No configuration required — uses defaults.Result Structure
Results include one entry per GPU intest_results and a combined aggregate_results. For single-GPU systems, aggregate_results mirrors the single GPU result.
Field Metadata
| Field | Display Name | Unit | Notes |
|---|---|---|---|
fp32_tflops | FP32 Performance | TFLOPS | Tensor Core (TF32) |
fp32_cuda_core_tflops | FP32 CUDA Core Performance | TFLOPS | CUDA cores only |
fp16_tflops | FP16 Performance | TFLOPS | |
bf16_tflops | BF16 Performance | TFLOPS | |
fp8_tflops | FP8 Performance | TFLOPS | Where supported |
mixed_precision_tflops | Mixed Precision Performance | TFLOPS | FP16 compute, FP32 accumulate |
memory_bandwidth_gbs | Memory Bandwidth | GB/s | HBM bandwidth |
l2_bandwidth_gbs | L2 Cache Bandwidth | GB/s | |
host_to_device_bandwidth_gbs | Host to Device Bandwidth | GB/s | PCIe |
device_to_host_bandwidth_gbs | Device to Host Bandwidth | GB/s | PCIe |
kernel_launch_overhead_us | Kernel Launch Overhead | μs | |
allreduce_bandwidth_gbs | AllReduce Bandwidth | GB/s | Multi-GPU only |
broadcast_bandwidth_gbs | Broadcast Bandwidth | GB/s | Multi-GPU only |
gpu_bandwidth_matrix | GPU Bandwidth Matrix | GB/s | Per-pair simplex/duplex with connection type |
fp32_tflops_per_peak_watt | FP32 TFLOPS per Peak Watt | TFLOPS/W | |
fp16_tflops_per_peak_watt | FP16 TFLOPS per Peak Watt | TFLOPS/W | |
fp8_tflops_per_peak_watt | FP8 TFLOPS per Peak Watt | TFLOPS/W | Where supported |
energy_consumption_wh | Energy Consumption | Wh | Total for benchmark run |
power_consumption_watts | Power Consumption | W | Peak power draw |
temperature_centigrade | GPU Temperature | °C | Peak temperature |
total_vram_mib | Total VRAM | MiB | |
gpu_clocks | GPU Clock Speeds | MHz | Compute and memory base/max/application/boost |
sample_frequency_s | Monitoring Sample Frequency | s | |
measurements_temp | Temperature Timeseries | °C | One sample per interval |
measurements_power_draw | Power Draw Timeseries | W | One sample per interval |
core_utilization_percent | Core Utilization Timeseries | % | One sample per interval |
memory_utilization_percent | Memory Utilization Timeseries | % | One sample per interval |
core_clock_mhz | Core Clock Timeseries | MHz | One sample per interval |
memory_clock_mhz | Memory Clock Timeseries | MHz | One sample per interval |
Cluster Network Benchmark
Overview
| Field | Value |
|---|---|
| Benchmark ID | cluster_network |
| Type | Multi-node |
| Min Nodes | 2 |
| Description | Tests network throughput and latency between all cluster nodes |
Result Structure
One measurement per directed node pair.Field Metadata
| Field | Display Name | Unit |
|---|---|---|
throughput_gbps | Throughput | Gbps |
throughput_mbps | Throughput | Mbps |
latency_ms | Latency (RTT) | ms |
measurement_count | Total Links Tested |
Inference Benchmark — vLLM
This benchmark measures LLM inference serving performance using vLLM. It supports both NVIDIA (CUDA) and AMD (ROCm) GPUs and runs without requiring an NGC API key.Overview
| Field | Value |
|---|---|
| Benchmark ID | inference_benchmark |
| Type | Single-node |
| Min Nodes | 1 |
| GPU Support | NVIDIA and AMD |
| Description | Multi-engine LLM inference performance benchmark using vLLM |
Configuration
| Field | Type | Required | Description | Default |
|---|---|---|---|---|
inference_engine | string | No | Inference engine | "vllm" |
model | string | No | Model name/path | "openai/gpt-oss-120b" |
tp | int | No | Tensor parallel size | 1 |
concurrency | int | Yes | Concurrent requests | — |
isl | int | Yes | Input sequence length (1–131072) | — |
osl | int | Yes | Output sequence length (1–131072) | — |
random_range_ratio | float | No | Prompt length variation | 0.0 |
num_prompts | int | No | Number of prompts (0 = concurrency × 10) | 0 |
Result Structure
Field Metadata
| Field | Display Name | Unit |
|---|---|---|
request_throughput | Request Throughput | req/s |
output_token_throughput | Output Token Throughput | tok/s |
total_token_throughput | Total Token Throughput | tok/s |
total_input_tokens | Total Input Tokens | tokens |
total_output_tokens | Total Output Tokens | tokens |
mean_ttft_ms | TTFT (Mean) | ms |
median_ttft_ms | TTFT (Median) | ms |
p99_ttft_ms | TTFT (P99) | ms |
mean_tpot_ms | TPOT (Mean) | ms |
median_tpot_ms | TPOT (Median) | ms |
p99_tpot_ms | TPOT (P99) | ms |
mean_itl_ms | Inter-Token Latency (Mean) | ms |
median_itl_ms | Inter-Token Latency (Median) | ms |
p99_itl_ms | Inter-Token Latency (P99) | ms |
mean_e2el_ms | End-to-End Latency (Mean) | ms |
median_e2el_ms | End-to-End Latency (Median) | ms |
p99_e2el_ms | End-to-End Latency (P99) | ms |
duration | Benchmark Duration | s |
Llama 3 Inference — NIM
This benchmark measures LLM inference serving performance using NVIDIA NIM containers, driven by GenAI-Perf.Overview
| Field | Value |
|---|---|
| Benchmark ID | llama3_inf_single |
| Type | Single-node |
| Min Nodes | 1 |
| Description | Llama 3 inference performance benchmark using NVIDIA NIM |
Requirements
- NGC API Key: Required (set as
NGC_API_KEYenvironment variable) - Podman: Required to run NIM and GenAI-Perf containers
Configuration
| Field | Type | Required | Description |
|---|---|---|---|
concurrency | int | Yes | Concurrent requests (1–10000) |
isl | int | Yes | Input sequence length (1–131072) |
osl | int | Yes | Output sequence length (1–131072) |
Result Structure
Each configuration produces aBenchmarkMetrics object. Each metric field contains a full statistical distribution.
avg, p25, p50, p75, p90, p95, p99, min, max, std.
Field Metadata
| Field | Display Name | Unit |
|---|---|---|
request_throughput | Request Throughput | req/s |
request_latency | Request Latency | ms |
time_to_first_token | Time to First Token (TTFT) | ms |
time_to_second_token | Time to Second Token | ms |
inter_token_latency | Inter-Token Latency (ITL) | ms |
output_token_throughput | Output Token Throughput | tok/s |
output_token_throughput_per_request | Output Token Throughput per Request | tok/s |
output_sequence_length | Output Sequence Length | tokens |
input_sequence_length | Input Sequence Length | tokens |
Llama 3 Fine-Tuning — NeMo
This benchmark measures LLM fine-tuning performance using NVIDIA’s NeMo framework with automatic memory-aware parallelism configuration.Overview
| Field | Value |
|---|---|
| Benchmark ID | llama3_ft_single |
| Type | Single-node |
| Min Nodes | 1 |
| Description | Llama 3 fine-tuning performance benchmark using NVIDIA NeMo |
Configuration
| Field | Type | Required | Description | Default | Constraints |
|---|---|---|---|---|---|
model_size | string | No | Model size | "8b" | "8b", "70b", "405b" |
dtype | string | No | Data type | "fp8" | "fp8", "bf16" |
fine_tune_type | string | No | Fine-tuning method | "lora" | "lora", "full" |
max_steps | int | No | Training steps | 50 |
Fixed Parameters
- Sequence Length: 4096 tokens
- Micro Batch Size: 1 (optimized for packed sequences)
- Training Data: Synthetic (SquadDataModule)
Requirements
Software Requirements
- NeMo Container:
nvcr.io/nvidia/nemo:25.11.01— downloaded automatically if not present - HuggingFace Token: Required (set as
HF_TOKENenvironment variable). Get your token from https://huggingface.co/settings/tokens - Docker: Required to run the NeMo container
- Disk Space:
- 8B model: ~75GB (55GB base + 20GB model)
- 70B model: ~205GB (55GB base + 150GB model)
- 405B model: ~905GB (55GB base + 850GB model)
Hardware Requirements
The benchmark automatically calculates memory requirements based on model configuration:| Model | FP8 Memory | BF16 Memory | LoRA reduction (~30%) |
|---|---|---|---|
| 8B | 20GB total | 35GB total | ~14GB / ~25GB |
| 70B | 85GB total | 160GB total | ~60GB / ~112GB |
| 405B | 450GB total | 850GB total | ~315GB / N/A |
- 8B LoRA: 1× GPU with ≥16GB VRAM
- 8B full: 1× GPU with ≥24GB VRAM
- 70B: ≥2 GPUs
- 405B: ≥8 GPUs (FP8 + LoRA only)
Parallelism Strategy
The benchmark automatically calculates optimal parallelism using a memory-aware strategy:- Tensor Parallelism (TP) = smallest power of 2 such that
total_memory / TP ≤ gpu_memory - Data Parallelism (DP) =
total_gpus / TP - Global Batch Size (GBS) =
min(DP × 2, model_cap)— caps: 8B→64, 70B→32, 405B→16
Example Configurations
| GPUs | Model | dtype | Fine-tune | TP | DP | GBS |
|---|---|---|---|---|---|---|
| 8× 80GB | 8B | fp8 | lora | 1 | 8 | 16 |
| 8× 80GB | 70B | fp8 | lora | 1 | 8 | 16 |
| 8× 80GB | 405B | fp8 | lora | 8 | 1 | 2 |
Result Structure
Metrics Calculation
- Tokens per Step =
global_batch_size × sequence_length - Tokens per Second =
tokens_per_step ÷ train_step_time_mean - Time to 1T Tokens =
10¹² ÷ (tokens_per_second × 86400)days - Step Time CV =
(train_step_time_std ÷ train_step_time_mean) × 100
Field Metadata
| Field | Display Name | Unit |
|---|---|---|
tokens_per_step | Tokens per Step | tokens |
tokens_per_second | Tokens per Second | tok/s |
train_step_time_mean | Training Step Time (Mean) | s |
train_step_time_std | Training Step Time (Std Dev) | s |
step_time_cv_percent | Step Time Coefficient of Variation | % |
time_to_1t_tokens_days | Time to 1T Tokens | days |
peak_memory_gb | Peak GPU Memory Usage | GB |
memory_efficiency_percent | GPU Memory Efficiency | % |