Skip to main content

Available Benchmarks

SiliconMark supports various benchmarks for GPU performance testing.
  1. Quick Mark - A comprehensive single-node GPU performance test that always runs first
  2. Cluster Network benchmark - Multi-node network connectivity and bandwidth testing
  3. LLM Fine-Tuning benchmark - Single-node LLM fine-tuning performance benchmarks
Each benchmark section includes configuration options, execution actions, result structures, and field metadata for interpreting the performance metrics.

QuickMark Benchmark

Overview

FieldValue
Benchmark IDquick_mark
TypeSingle-node
Min Nodes1
DescriptionComprehensive GPU performance test

Configuration

No configuration required - uses defaults.

Result Structure

{
  "test_results": [
    {
      "gpu_id": "GPU-000...",
      "bf16_tflops": 156.8,
      "fp16_tflops": 78.5,
      "fp32_tflops": 39.2,
      "fp32_cuda_core_tflops": 19.5,
      "mixed_precision_tflops": 312.4,
      "l2_bandwidth_gbs": 3890.5,
      "memory_bandwidth_gbs": 1935.4,
      "temperature_centigrade": 65,
      "power_consumption_watts": 350,
      "kernel_launch_overhead_us": 12.5,
      "device_to_host_bandwidth_gbs": 24.8,
      "host_to_device_bandwidth_gbs": 25.2
    },
    {
       "gpu_id": "GPU-111...",
       ...
    }
  ],
  "aggregate_results": {
    "total_fp16_tflops": 628.0,
    "total_fp32_tflops": 313.6,
    "avg_temperature": 67.2,
    "allreduce_bandwidth_gbs": 180.5,
    "broadcast_bandwidth_gbs": 175.3,
    "fp16_tflops_per_peak_watt": 1.79,
    "fp32_tflops_per_peak_watt": 0.89,
    "gpu_bandwidth_matrix": [
      {
        "source_gpu": 0,
        "target_gpu": 1,
        "duplex_gbs": 450.2,
        "simplex_gbs": 225.1,
        "connection_type": ...
      }
    ]
  }
}

Field Metadata

FieldDisplay NameUnit
bf16_tflopsBF16 PerformanceTFLOPS
fp16_tflopsFP16 PerformanceTFLOPS
fp32_tflopsFP32 PerformanceTFLOPS
fp32_cuda_core_tflopsFP32 CUDA Core PerformanceTFLOPS
mixed_precision_tflopsMixed Precision PerformanceTFLOPS
l2_bandwidth_gbsL2 Cache BandwidthGB/s
memory_bandwidth_gbsMemory BandwidthGB/s
temperature_centigradeGPU Temperature°C
power_consumption_wattsPower ConsumptionW
kernel_launch_overhead_usKernel Launch Overheadμs
device_to_host_bandwidth_gbsDevice to Host BandwidthGB/s
host_to_device_bandwidth_gbsHost to Device BandwidthGB/s
allreduce_bandwidth_gbsAllReduce BandwidthGB/s
broadcast_bandwidth_gbsBroadcast BandwidthGB/s
fp16_tflops_per_peak_wattFP16 TFLOPS per Peak WattTFLOPS/W
fp32_tflops_per_peak_wattFP32 TFLOPS per Peak WattTFLOPS/W

Cluster Network Benchmark

Overview

FieldValue
Benchmark IDcluster_network
TypeMulti-node
Min Nodes2
DescriptionTests network connectivity and bandwidth between cluster nodes

Result Structure (Cluster-level)

{
  "avg_bandwidth_gbps": 45.2,
  "avg_latency": 0.8,
  "min_bandwidth_gbps": 42.1,
  "total_links_tested": 12,
  "node_count": 4,
  "measurements": [
    {
      "host_ip": "192.168.1.10",
      "dest_ip": "192.168.1.11",
      "throughput_mbps": 45200,
      "throughput_gbps": 45.2,
      "latency_ms": 0.75
    }
  ]
}

Field Metadata

FieldDisplay NameUnit
throughput_gbpsThroughputGbps
throughput_mbpsThroughputMbps
latency_msLatencyms
avg_bandwidth_gbpsAverage BandwidthGbps
min_bandwidth_gbpsMinimum BandwidthGbps
total_links_testedTotal Links Count Tested
node_countTotal Node Count

Llama3 Fine-tuning Single Node

This benchmark requires a Hugging Face token and uses NVIDIA DGX benchmark methodology with NeMo container. Please set environment variable HF_TOKEN with your Hugging Face token before running the benchmark.

Overview

FieldValue
Benchmark IDllama3_ft_single
TypeSingle-node
Min Nodes1
DescriptionLlama3 model fine-tuning performance benchmark

Configuration

FieldTypeRequiredDescriptionDefaultConstraints
model_sizestringNoModel size”8b""8b”, “70b”, “405b”
dtypestringNoData type”fp16""fp8”, “fp16”, “bf16”
fine_tune_typestringNoFine-tuning method”lora""lora”, “sft”
global_batch_sizeintegerNoGlobal batch size8[ 1 .. 128 ]
max_stepsintegerNoMaximum steps50[ 1 .. 100 ]

Configuration Example

{
  "model_size": "8b",
  "dtype": "fp16",
  "fine_tune_type": "lora",
  "global_batch_size": 8,
  "max_steps": 50
}

Configuration Constraints

  • For 405b model: maximum batch size is 32
  • Batch size must be between 1 and 128
  • Max steps must be between 1 and 100

Result Structure

{
  "tokens_per_step": 16384,
  "tokens_per_second": 125000,
  "train_step_time_mean": 0.328,
  "train_step_time_std": 0.012,
  "step_time_cv_percent": 3.66,
  "time_to_1t_tokens_days": 92.6
}

Field Metadata

FieldDisplay NameUnit
tokens_per_stepTokens per Steptokens
tokens_per_secondTokens per Secondtokens/s
train_step_time_meanTraining Step Time (Mean)s
train_step_time_stdTraining Step Time (Std Dev)s
step_time_cv_percentStep Time CV%%
time_to_1t_tokens_daysTime to 1T Tokensdays

I