Skip to main content

Available Benchmarks

SiliconMark supports various benchmarks for GPU performance testing.
  1. Quick Mark - A comprehensive single-node GPU performance test that always runs first
  2. Cluster Network benchmark - Multi-node network connectivity and bandwidth testing
  3. LLM Fine-Tuning benchmark - Single-node LLM fine-tuning performance benchmarks
Each benchmark section includes configuration options, execution actions, result structures, and field metadata for interpreting the performance metrics.

QuickMark Benchmark

Overview

FieldValue
Benchmark IDquick_mark
TypeSingle-node
Min Nodes1
DescriptionComprehensive GPU performance test

Configuration

No configuration required - uses defaults.

Result Structure

{
  "test_results": [
    {
      "gpu_id": "GPU-000...",
      "bf16_tflops": 156.8,
      "fp16_tflops": 78.5,
      "fp32_tflops": 39.2,
      "fp32_cuda_core_tflops": 19.5,
      "mixed_precision_tflops": 312.4,
      "l2_bandwidth_gbs": 3890.5,
      "memory_bandwidth_gbs": 1935.4,
      "temperature_centigrade": 65,
      "power_consumption_watts": 350,
      "kernel_launch_overhead_us": 12.5,
      "device_to_host_bandwidth_gbs": 24.8,
      "host_to_device_bandwidth_gbs": 25.2
    },
    {
       "gpu_id": "GPU-111...",
       ...
    }
  ],
  "aggregate_results": {
    "total_fp16_tflops": 628.0,
    "total_fp32_tflops": 313.6,
    "avg_temperature": 67.2,
    "allreduce_bandwidth_gbs": 180.5,
    "broadcast_bandwidth_gbs": 175.3,
    "fp16_tflops_per_peak_watt": 1.79,
    "fp32_tflops_per_peak_watt": 0.89,
    "gpu_bandwidth_matrix": [
      {
        "source_gpu": 0,
        "target_gpu": 1,
        "duplex_gbs": 450.2,
        "simplex_gbs": 225.1,
        "connection_type": ...
      }
    ]
  }
}

Field Metadata

FieldDisplay NameUnit
bf16_tflopsBF16 PerformanceTFLOPS
fp16_tflopsFP16 PerformanceTFLOPS
fp32_tflopsFP32 PerformanceTFLOPS
fp32_cuda_core_tflopsFP32 CUDA Core PerformanceTFLOPS
mixed_precision_tflopsMixed Precision PerformanceTFLOPS
l2_bandwidth_gbsL2 Cache BandwidthGB/s
memory_bandwidth_gbsMemory BandwidthGB/s
temperature_centigradeGPU Temperature°C
power_consumption_wattsPower ConsumptionW
kernel_launch_overhead_usKernel Launch Overheadμs
device_to_host_bandwidth_gbsDevice to Host BandwidthGB/s
host_to_device_bandwidth_gbsHost to Device BandwidthGB/s
allreduce_bandwidth_gbsAllReduce BandwidthGB/s
broadcast_bandwidth_gbsBroadcast BandwidthGB/s
fp16_tflops_per_peak_wattFP16 TFLOPS per Peak WattTFLOPS/W
fp32_tflops_per_peak_wattFP32 TFLOPS per Peak WattTFLOPS/W

Cluster Network Benchmark

Overview

FieldValue
Benchmark IDcluster_network
TypeMulti-node
Min Nodes2
DescriptionTests network connectivity and bandwidth between cluster nodes

Result Structure (Cluster-level)

{
  "avg_bandwidth_gbps": 45.2,
  "avg_latency": 0.8,
  "min_bandwidth_gbps": 42.1,
  "total_links_tested": 12,
  "node_count": 4,
  "measurements": [
    {
      "host_ip": "192.168.1.10",
      "dest_ip": "192.168.1.11",
      "throughput_mbps": 45200,
      "throughput_gbps": 45.2,
      "latency_ms": 0.75
    }
  ]
}

Field Metadata

FieldDisplay NameUnit
throughput_gbpsThroughputGbps
throughput_mbpsThroughputMbps
latency_msLatencyms
avg_bandwidth_gbpsAverage BandwidthGbps
min_bandwidth_gbpsMinimum BandwidthGbps
total_links_testedTotal Links Count Tested
node_countTotal Node Count

Llama3 Fine-tuning Single Node

This benchmark measures LLM fine-tuning performance using NVIDIA’s NeMo framework with automatic memory-aware parallelism configuration.

Overview

FieldValue
Benchmark IDllama3_ft_single
TypeSingle-node
Min Nodes1
DescriptionLlama3 model fine-tuning performance benchmark

Configuration

FieldTypeRequiredDescriptionDefaultConstraints
model_sizestringNoModel size”8b""8b”, “70b”, “405b”
dtypestringNoData type”fp16""fp8”, “fp16”, “bf16”
fine_tune_typestringNoFine-tuning method”lora""lora”, “sft”

Fixed Parameters

  • Sequence Length: 4096 tokens
  • Micro Batch Size: 1 (optimized for packed sequences)
  • Max Steps: 50 (default, configurable)
  • Training Data: Synthetic (SquadDataModule)

Requirements

Software Requirements

  • NeMo Container: nvcr.io/nvidia/nemo:24.12 will be downloaded automatically.
  • HuggingFace Token: Required (set as HF_TOKEN environment variable). Get your token from https://huggingface.co/settings/tokens
  • Disk Space Requirements:
    • 8B model: ~75GB (55GB base + 20GB model)
    • 70B model: ~205GB (55GB base + 150GB model)
    • 405B model: ~905GB (55GB base + 850GB model)
Set environment variables:
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxx"  # Required
export STAGE_PATH="$HOME/workspace/benchmark_stage"  # Optional

Hardware Requirements

The benchmark automatically calculates memory requirements based on model configuration:
ModelTypeFP8 MemoryBF16 MemoryLoRA Reduction
8BFull20GB35GB~30%
70BFull85GB160GB~30%
405BFull450GB850GB~30%
Example configurations:
  • 8B LoRA FP8: ~14GB total (fits on single 40GB GPU)
  • 70B LoRA FP8: ~60GB total (needs TP=2 on 40GB GPUs or single 80GB GPU)
  • 405B LoRA FP8: ~315GB total (needs TP=8 on 80GB GPUs)

Parallelism Strategy

The benchmark automatically calculates optimal parallelism using a memory-aware strategy.

Auto-Parallelism Algorithm

  1. Calculate Total Memory: Based on model size, dtype, and fine-tuning type
  2. Determine Tensor Parallelism (TP):
    min_TP = ceil(total_memory / gpu_memory)
    TP = nearest_power_of_2 ≥ min_TP
    
  3. Calculate Data Parallelism (DP):
    DP = total_gpus / TP
    
  4. Set Global Batch Size (GBS):
    GBS = DP * 2 (capped by model-specific limits)
    - 8B: max 64
    - 70B: max 32
    - 405B: max 16
    

Parallelism Components

  1. Tensor Parallelism (TP)
    • Splits model layers across GPUs when model exceeds single GPU memory
    • Automatically calculated as smallest power of 2 that fits
    • Example: 70B model with FP16 (160GB) on 80GB GPUs requires TP=2
  2. Data Parallelism (DP)
    • Uses remaining GPUs for parallel batch processing: DP = gpu_count / TP
    • Each DP replica processes different samples simultaneously
    • Example: 8 GPUs with TP=2 gives DP=4
  3. Global Batch Size (GBS)
    • Total samples per training iteration
    • Auto-scaled based on DP with model-specific caps
    • Formula: GBS = min(DP * 2, model_cap)

Example Configurations

GPUsModelData TypeTPDPGBS
8×A100-80GB8BFP8 + LoRA1816
8×H100-80GB70BFP8 + LoRA1816

Result Structure

{
  "tokens_per_step": 16384,
  "tokens_per_second": 125000,
  "train_step_time_mean": 0.328,
  "train_step_time_std": 0.012,
  "step_time_cv_percent": 3.66,
  "time_to_1t_tokens_days": 92.6
}

Metrics Calculation

  • Tokens per Step = global_batch_size × 4096
  • Tokens per Second = tokens_per_step ÷ train_step_time_mean
  • Time to 1T Tokens = 10^12 ÷ (tokens_per_second × 86400) days

Field Metadata

FieldDisplay NameUnit
tokens_per_stepTokens per Steptokens
tokens_per_secondTokens per Secondtokens/s
train_step_time_meanTraining Step Time (Mean)s
train_step_time_stdTraining Step Time (Std Dev)s
step_time_cv_percentStep Time CV%%
time_to_1t_tokens_daysTime to 1T Tokensdays

Configuration Example

{
  "model_size": "8b",
  "dtype": "fp8",
  "fine_tune_type": "lora",
}