Available Benchmarks
SiliconMark supports various benchmarks for GPU performance testing.- Quick Mark - A comprehensive single-node GPU performance test that always runs first
- Cluster Network benchmark - Multi-node network connectivity and bandwidth testing
- LLM Fine-Tuning benchmark - Single-node LLM fine-tuning performance benchmarks
QuickMark Benchmark
Overview
Field | Value |
---|---|
Benchmark ID | quick_mark |
Type | Single-node |
Min Nodes | 1 |
Description | Comprehensive GPU performance test |
Configuration
No configuration required - uses defaults.Result Structure
Field Metadata
Field | Display Name | Unit |
---|---|---|
bf16_tflops | BF16 Performance | TFLOPS |
fp16_tflops | FP16 Performance | TFLOPS |
fp32_tflops | FP32 Performance | TFLOPS |
fp32_cuda_core_tflops | FP32 CUDA Core Performance | TFLOPS |
mixed_precision_tflops | Mixed Precision Performance | TFLOPS |
l2_bandwidth_gbs | L2 Cache Bandwidth | GB/s |
memory_bandwidth_gbs | Memory Bandwidth | GB/s |
temperature_centigrade | GPU Temperature | °C |
power_consumption_watts | Power Consumption | W |
kernel_launch_overhead_us | Kernel Launch Overhead | μs |
device_to_host_bandwidth_gbs | Device to Host Bandwidth | GB/s |
host_to_device_bandwidth_gbs | Host to Device Bandwidth | GB/s |
allreduce_bandwidth_gbs | AllReduce Bandwidth | GB/s |
broadcast_bandwidth_gbs | Broadcast Bandwidth | GB/s |
fp16_tflops_per_peak_watt | FP16 TFLOPS per Peak Watt | TFLOPS/W |
fp32_tflops_per_peak_watt | FP32 TFLOPS per Peak Watt | TFLOPS/W |
Cluster Network Benchmark
Overview
Field | Value |
---|---|
Benchmark ID | cluster_network |
Type | Multi-node |
Min Nodes | 2 |
Description | Tests network connectivity and bandwidth between cluster nodes |
Result Structure (Cluster-level)
Field Metadata
Field | Display Name | Unit |
---|---|---|
throughput_gbps | Throughput | Gbps |
throughput_mbps | Throughput | Mbps |
latency_ms | Latency | ms |
avg_bandwidth_gbps | Average Bandwidth | Gbps |
min_bandwidth_gbps | Minimum Bandwidth | Gbps |
total_links_tested | Total Links Count Tested | |
node_count | Total Node Count |
Llama3 Fine-tuning Single Node
This benchmark requires a Hugging Face token and uses NVIDIA DGX benchmark methodology with NeMo container. Please set environment variableHF_TOKEN
with your Hugging Face token before running the benchmark.
Overview
Field | Value |
---|---|
Benchmark ID | llama3_ft_single |
Type | Single-node |
Min Nodes | 1 |
Description | Llama3 model fine-tuning performance benchmark |
Configuration
Field | Type | Required | Description | Default | Constraints |
---|---|---|---|---|---|
model_size | string | No | Model size | ”8b" | "8b”, “70b”, “405b” |
dtype | string | No | Data type | ”fp16" | "fp8”, “fp16”, “bf16” |
fine_tune_type | string | No | Fine-tuning method | ”lora" | "lora”, “sft” |
global_batch_size | integer | No | Global batch size | 8 | [ 1 .. 128 ] |
max_steps | integer | No | Maximum steps | 50 | [ 1 .. 100 ] |
Configuration Example
Configuration Constraints
- For
405b
model: maximum batch size is 32 - Batch size must be between 1 and 128
- Max steps must be between 1 and 100
Result Structure
Field Metadata
Field | Display Name | Unit |
---|---|---|
tokens_per_step | Tokens per Step | tokens |
tokens_per_second | Tokens per Second | tokens/s |
train_step_time_mean | Training Step Time (Mean) | s |
train_step_time_std | Training Step Time (Std Dev) | s |
step_time_cv_percent | Step Time CV% | % |
time_to_1t_tokens_days | Time to 1T Tokens | days |