Skip to main content

Available Benchmarks

SiliconMark supports various benchmarks for GPU performance testing.
  1. QuickMark - Comprehensive single-node GPU compute and memory performance test
  2. Cluster Network - Multi-node network connectivity and bandwidth testing
  3. Inference Benchmark - Multi-engine LLM inference performance (NVIDIA and AMD, using vLLM)
  4. Llama 3 Inference - Single-node LLM inference performance using NVIDIA NIM
  5. Llama 3 Fine-Tuning - Single-node LLM fine-tuning performance using NVIDIA NeMo
Each benchmark section includes configuration options, execution actions, result structures, and field metadata for interpreting the performance metrics.

QuickMark Benchmark

Overview

FieldValue
Benchmark IDquick_mark
TypeSingle-node
Min Nodes1
DescriptionComprehensive GPU compute, memory, and interconnect performance test

Configuration

No configuration required — uses defaults.

Result Structure

Results include one entry per GPU in test_results and a combined aggregate_results. For single-GPU systems, aggregate_results mirrors the single GPU result.
{
  "test_results": [
    {
      "gpu_id": "GPU-000...",
      "gpu_model": "NVIDIA H100 80GB HBM3",
      "fp32_tflops": 367.5,
      "fp32_cuda_core_tflops": 53.6,
      "fp16_tflops": 684.6,
      "bf16_tflops": 729.6,
      "fp8_tflops": 1456.2,
      "mixed_precision_tflops": 648.9,
      "memory_bandwidth_gbs": 3025.0,
      "l2_bandwidth_gbs": 415.5,
      "host_to_device_bandwidth_gbs": 27.7,
      "device_to_host_bandwidth_gbs": 28.5,
      "kernel_launch_overhead_us": 7.6,
      "power_consumption_watts": 654.3,
      "temperature_centigrade": 59,
      "fp32_tflops_per_peak_watt": 0.564,
      "fp16_tflops_per_peak_watt": 1.082,
      "fp8_tflops_per_peak_watt": 2.163,
      "energy_consumption_wh": 45.2,
      "total_vram_mib": 81559.0,
      "gpu_clocks": {
        "compute": { "base_mhz": 1110, "max_mhz": 1980 },
        "memory":  { "base_mhz": 2619, "max_mhz": 2619 }
      },
      "sample_frequency_s": 5.0,
      "measurements_temp": [58, 59, 60, 59],
      "measurements_power_draw": [640.1, 654.3, 651.0, 648.2],
      "core_utilization_percent": [98.0, 99.0, 98.5, 99.0],
      "memory_utilization_percent": [95.0, 96.0, 95.5, 96.0],
      "core_clock_mhz": [1965, 1980, 1975, 1980],
      "memory_clock_mhz": [2619, 2619, 2619, 2619]
    }
  ],
  "aggregate_results": {
    "gpu_id": "aggregate",
    "fp32_tflops": 2863.0,
    "fp32_cuda_core_tflops": 424.2,
    "fp16_tflops": 5492.8,
    "bf16_tflops": 5755.8,
    "mixed_precision_tflops": 4222.2,
    "memory_bandwidth_gbs": 22717.3,
    "allreduce_bandwidth_gbs": 275.2,
    "broadcast_bandwidth_gbs": 392.0,
    "host_to_device_bandwidth_gbs": 110.6,
    "device_to_host_bandwidth_gbs": 112.9,
    "power_consumption_watts": 5078.5,
    "temperature_centigrade": 68,
    "fp32_tflops_per_peak_watt": 0.564,
    "fp16_tflops_per_peak_watt": 1.082,
    "gpu_bandwidth_matrix": {
      "gpu0_to_gpu1": {
        "connection_type": "nvlink_v4_18x",
        "simplex_gbs": 388.1,
        "duplex_gbs": 389.2
      },
      "gpu0_to_gpu2": {
        "connection_type": "nvlink_v4_18x",
        "simplex_gbs": 389.2,
        "duplex_gbs": 389.9
      }
    }
  },
  "timestamp": "YYYY-MM-DDT20:03:06Z"
}

Field Metadata

FieldDisplay NameUnitNotes
fp32_tflopsFP32 PerformanceTFLOPSTensor Core (TF32)
fp32_cuda_core_tflopsFP32 CUDA Core PerformanceTFLOPSCUDA cores only
fp16_tflopsFP16 PerformanceTFLOPS
bf16_tflopsBF16 PerformanceTFLOPS
fp8_tflopsFP8 PerformanceTFLOPSWhere supported
mixed_precision_tflopsMixed Precision PerformanceTFLOPSFP16 compute, FP32 accumulate
memory_bandwidth_gbsMemory BandwidthGB/sHBM bandwidth
l2_bandwidth_gbsL2 Cache BandwidthGB/s
host_to_device_bandwidth_gbsHost to Device BandwidthGB/sPCIe
device_to_host_bandwidth_gbsDevice to Host BandwidthGB/sPCIe
kernel_launch_overhead_usKernel Launch Overheadμs
allreduce_bandwidth_gbsAllReduce BandwidthGB/sMulti-GPU only
broadcast_bandwidth_gbsBroadcast BandwidthGB/sMulti-GPU only
gpu_bandwidth_matrixGPU Bandwidth MatrixGB/sPer-pair simplex/duplex with connection type
fp32_tflops_per_peak_wattFP32 TFLOPS per Peak WattTFLOPS/W
fp16_tflops_per_peak_wattFP16 TFLOPS per Peak WattTFLOPS/W
fp8_tflops_per_peak_wattFP8 TFLOPS per Peak WattTFLOPS/WWhere supported
energy_consumption_whEnergy ConsumptionWhTotal for benchmark run
power_consumption_wattsPower ConsumptionWPeak power draw
temperature_centigradeGPU Temperature°CPeak temperature
total_vram_mibTotal VRAMMiB
gpu_clocksGPU Clock SpeedsMHzCompute and memory base/max/application/boost
sample_frequency_sMonitoring Sample Frequencys
measurements_tempTemperature Timeseries°COne sample per interval
measurements_power_drawPower Draw TimeseriesWOne sample per interval
core_utilization_percentCore Utilization Timeseries%One sample per interval
memory_utilization_percentMemory Utilization Timeseries%One sample per interval
core_clock_mhzCore Clock TimeseriesMHzOne sample per interval
memory_clock_mhzMemory Clock TimeseriesMHzOne sample per interval

Cluster Network Benchmark

Overview

FieldValue
Benchmark IDcluster_network
TypeMulti-node
Min Nodes2
DescriptionTests network throughput and latency between all cluster nodes

Result Structure

One measurement per directed node pair.
{
  "network_results": [
    {
      "host_ip": "192.168.1.10",
      "dest_ip": "192.168.1.11",
      "throughput_mbps": 45200.0,
      "throughput_gbps": 45.2,
      "latency_ms": 0.75
    }
  ],
  "measurement_count": 12
}

Field Metadata

FieldDisplay NameUnit
throughput_gbpsThroughputGbps
throughput_mbpsThroughputMbps
latency_msLatency (RTT)ms
measurement_countTotal Links Tested

Inference Benchmark — vLLM

This benchmark measures LLM inference serving performance using vLLM. It supports both NVIDIA (CUDA) and AMD (ROCm) GPUs and runs without requiring an NGC API key.

Overview

FieldValue
Benchmark IDinference_benchmark
TypeSingle-node
Min Nodes1
GPU SupportNVIDIA and AMD
DescriptionMulti-engine LLM inference performance benchmark using vLLM

Configuration

FieldTypeRequiredDescriptionDefault
inference_enginestringNoInference engine"vllm"
modelstringNoModel name/path"openai/gpt-oss-120b"
tpintNoTensor parallel size1
concurrencyintYesConcurrent requests
islintYesInput sequence length (1–131072)
oslintYesOutput sequence length (1–131072)
random_range_ratiofloatNoPrompt length variation0.0
num_promptsintNoNumber of prompts (0 = concurrency × 10)0

Result Structure

{
  "duration": 120.5,
  "total_input_tokens": 512000,
  "total_output_tokens": 128000,
  "request_throughput": 42.3,
  "output_token_throughput": 5890.4,
  "total_token_throughput": 8750.2,
  "mean_ttft_ms": 145.2,
  "median_ttft_ms": 138.7,
  "p99_ttft_ms": 312.4,
  "mean_tpot_ms": 18.4,
  "median_tpot_ms": 17.9,
  "p99_tpot_ms": 28.6,
  "mean_itl_ms": 18.4,
  "median_itl_ms": 17.9,
  "p99_itl_ms": 28.6,
  "mean_e2el_ms": 2340.5,
  "median_e2el_ms": 2180.3,
  "p99_e2el_ms": 4120.8
}

Field Metadata

FieldDisplay NameUnit
request_throughputRequest Throughputreq/s
output_token_throughputOutput Token Throughputtok/s
total_token_throughputTotal Token Throughputtok/s
total_input_tokensTotal Input Tokenstokens
total_output_tokensTotal Output Tokenstokens
mean_ttft_msTTFT (Mean)ms
median_ttft_msTTFT (Median)ms
p99_ttft_msTTFT (P99)ms
mean_tpot_msTPOT (Mean)ms
median_tpot_msTPOT (Median)ms
p99_tpot_msTPOT (P99)ms
mean_itl_msInter-Token Latency (Mean)ms
median_itl_msInter-Token Latency (Median)ms
p99_itl_msInter-Token Latency (P99)ms
mean_e2el_msEnd-to-End Latency (Mean)ms
median_e2el_msEnd-to-End Latency (Median)ms
p99_e2el_msEnd-to-End Latency (P99)ms
durationBenchmark Durations

Llama 3 Inference — NIM

This benchmark measures LLM inference serving performance using NVIDIA NIM containers, driven by GenAI-Perf.

Overview

FieldValue
Benchmark IDllama3_inf_single
TypeSingle-node
Min Nodes1
DescriptionLlama 3 inference performance benchmark using NVIDIA NIM

Requirements

  • NGC API Key: Required (set as NGC_API_KEY environment variable)
  • Podman: Required to run NIM and GenAI-Perf containers

Configuration

FieldTypeRequiredDescription
concurrencyintYesConcurrent requests (1–10000)
islintYesInput sequence length (1–131072)
oslintYesOutput sequence length (1–131072)
Multiple configurations can be submitted in a single job run.

Result Structure

Each configuration produces a BenchmarkMetrics object. Each metric field contains a full statistical distribution.
{
  "request_throughput":     { "unit": "req/s", "avg": 42.3, "p50": 41.8, "p90": 45.1, "p99": 47.2, "min": 38.0, "max": 48.5 },
  "request_latency":        { "unit": "ms",    "avg": 2340.5, "p50": 2180.3, "p90": 3800.1, "p99": 4120.8 },
  "time_to_first_token":    { "unit": "ms",    "avg": 145.2, "p50": 138.7, "p90": 290.4, "p99": 312.4 },
  "time_to_second_token":   { "unit": "ms",    "avg": 163.6, "p50": 156.8 },
  "inter_token_latency":    { "unit": "ms",    "avg": 18.4,  "p50": 17.9,  "p90": 25.1,  "p99": 28.6 },
  "output_token_throughput":           { "unit": "tok/s", "avg": 5890.4 },
  "output_token_throughput_per_request": { "unit": "tok/s", "avg": 139.1 },
  "output_sequence_length": { "unit": "tokens", "avg": 512.0 },
  "input_sequence_length":  { "unit": "tokens", "avg": 256.0 }
}
Each metric object may include: avg, p25, p50, p75, p90, p95, p99, min, max, std.

Field Metadata

FieldDisplay NameUnit
request_throughputRequest Throughputreq/s
request_latencyRequest Latencyms
time_to_first_tokenTime to First Token (TTFT)ms
time_to_second_tokenTime to Second Tokenms
inter_token_latencyInter-Token Latency (ITL)ms
output_token_throughputOutput Token Throughputtok/s
output_token_throughput_per_requestOutput Token Throughput per Requesttok/s
output_sequence_lengthOutput Sequence Lengthtokens
input_sequence_lengthInput Sequence Lengthtokens

Llama 3 Fine-Tuning — NeMo

This benchmark measures LLM fine-tuning performance using NVIDIA’s NeMo framework with automatic memory-aware parallelism configuration.

Overview

FieldValue
Benchmark IDllama3_ft_single
TypeSingle-node
Min Nodes1
DescriptionLlama 3 fine-tuning performance benchmark using NVIDIA NeMo

Configuration

FieldTypeRequiredDescriptionDefaultConstraints
model_sizestringNoModel size"8b""8b", "70b", "405b"
dtypestringNoData type"fp8""fp8", "bf16"
fine_tune_typestringNoFine-tuning method"lora""lora", "full"
max_stepsintNoTraining steps50

Fixed Parameters

  • Sequence Length: 4096 tokens
  • Micro Batch Size: 1 (optimized for packed sequences)
  • Training Data: Synthetic (SquadDataModule)

Requirements

Software Requirements

  • NeMo Container: nvcr.io/nvidia/nemo:25.11.01 — downloaded automatically if not present
  • HuggingFace Token: Required (set as HF_TOKEN environment variable). Get your token from https://huggingface.co/settings/tokens
  • Docker: Required to run the NeMo container
  • Disk Space:
    • 8B model: ~75GB (55GB base + 20GB model)
    • 70B model: ~205GB (55GB base + 150GB model)
    • 405B model: ~905GB (55GB base + 850GB model)
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxx"  # Required
export STAGE_PATH="$HOME/workspace/benchmark_stage"  # Optional, defaults to ~/workspace/benchmark_stage

Hardware Requirements

The benchmark automatically calculates memory requirements based on model configuration:
ModelFP8 MemoryBF16 MemoryLoRA reduction (~30%)
8B20GB total35GB total~14GB / ~25GB
70B85GB total160GB total~60GB / ~112GB
405B450GB total850GB total~315GB / N/A
Minimum GPU requirements:
  • 8B LoRA: 1× GPU with ≥16GB VRAM
  • 8B full: 1× GPU with ≥24GB VRAM
  • 70B: ≥2 GPUs
  • 405B: ≥8 GPUs (FP8 + LoRA only)

Parallelism Strategy

The benchmark automatically calculates optimal parallelism using a memory-aware strategy:
  1. Tensor Parallelism (TP) = smallest power of 2 such that total_memory / TP ≤ gpu_memory
  2. Data Parallelism (DP) = total_gpus / TP
  3. Global Batch Size (GBS) = min(DP × 2, model_cap) — caps: 8B→64, 70B→32, 405B→16

Example Configurations

GPUsModeldtypeFine-tuneTPDPGBS
8× 80GB8Bfp8lora1816
8× 80GB70Bfp8lora1816
8× 80GB405Bfp8lora812

Result Structure

{
  "tokens_per_step": 65536,
  "tokens_per_second": 48617.2,
  "train_step_time_mean": 1.348,
  "train_step_time_std": 0.003,
  "step_time_cv_percent": 0.223,
  "time_to_1t_tokens_days": 238.1,
  "peak_memory_gb": 68.4,
  "memory_efficiency_percent": 85.5
}

Metrics Calculation

  • Tokens per Step = global_batch_size × sequence_length
  • Tokens per Second = tokens_per_step ÷ train_step_time_mean
  • Time to 1T Tokens = 10¹² ÷ (tokens_per_second × 86400) days
  • Step Time CV = (train_step_time_std ÷ train_step_time_mean) × 100

Field Metadata

FieldDisplay NameUnit
tokens_per_stepTokens per Steptokens
tokens_per_secondTokens per Secondtok/s
train_step_time_meanTraining Step Time (Mean)s
train_step_time_stdTraining Step Time (Std Dev)s
step_time_cv_percentStep Time Coefficient of Variation%
time_to_1t_tokens_daysTime to 1T Tokensdays
peak_memory_gbPeak GPU Memory UsageGB
memory_efficiency_percentGPU Memory Efficiency%