Available Benchmarks
SiliconMark supports various benchmarks for GPU performance testing.- Quick Mark - A comprehensive single-node GPU performance test that always runs first
- Cluster Network benchmark - Multi-node network connectivity and bandwidth testing
- LLM Fine-Tuning benchmark - Single-node LLM fine-tuning performance benchmarks
QuickMark Benchmark
Overview
| Field | Value |
|---|---|
| Benchmark ID | quick_mark |
| Type | Single-node |
| Min Nodes | 1 |
| Description | Comprehensive GPU performance test |
Configuration
No configuration required - uses defaults.Result Structure
Field Metadata
| Field | Display Name | Unit |
|---|---|---|
bf16_tflops | BF16 Performance | TFLOPS |
fp16_tflops | FP16 Performance | TFLOPS |
fp32_tflops | FP32 Performance | TFLOPS |
fp32_cuda_core_tflops | FP32 CUDA Core Performance | TFLOPS |
mixed_precision_tflops | Mixed Precision Performance | TFLOPS |
l2_bandwidth_gbs | L2 Cache Bandwidth | GB/s |
memory_bandwidth_gbs | Memory Bandwidth | GB/s |
temperature_centigrade | GPU Temperature | °C |
power_consumption_watts | Power Consumption | W |
kernel_launch_overhead_us | Kernel Launch Overhead | μs |
device_to_host_bandwidth_gbs | Device to Host Bandwidth | GB/s |
host_to_device_bandwidth_gbs | Host to Device Bandwidth | GB/s |
allreduce_bandwidth_gbs | AllReduce Bandwidth | GB/s |
broadcast_bandwidth_gbs | Broadcast Bandwidth | GB/s |
fp16_tflops_per_peak_watt | FP16 TFLOPS per Peak Watt | TFLOPS/W |
fp32_tflops_per_peak_watt | FP32 TFLOPS per Peak Watt | TFLOPS/W |
Cluster Network Benchmark
Overview
| Field | Value |
|---|---|
| Benchmark ID | cluster_network |
| Type | Multi-node |
| Min Nodes | 2 |
| Description | Tests network connectivity and bandwidth between cluster nodes |
Result Structure (Cluster-level)
Field Metadata
| Field | Display Name | Unit |
|---|---|---|
throughput_gbps | Throughput | Gbps |
throughput_mbps | Throughput | Mbps |
latency_ms | Latency | ms |
avg_bandwidth_gbps | Average Bandwidth | Gbps |
min_bandwidth_gbps | Minimum Bandwidth | Gbps |
total_links_tested | Total Links Count Tested | |
node_count | Total Node Count |
Llama3 Fine-tuning Single Node
This benchmark measures LLM fine-tuning performance using NVIDIA’s NeMo framework with automatic memory-aware parallelism configuration.Overview
| Field | Value |
|---|---|
| Benchmark ID | llama3_ft_single |
| Type | Single-node |
| Min Nodes | 1 |
| Description | Llama3 model fine-tuning performance benchmark |
Configuration
| Field | Type | Required | Description | Default | Constraints |
|---|---|---|---|---|---|
model_size | string | No | Model size | ”8b" | "8b”, “70b”, “405b” |
dtype | string | No | Data type | ”fp16" | "fp8”, “fp16”, “bf16” |
fine_tune_type | string | No | Fine-tuning method | ”lora" | "lora”, “sft” |
Fixed Parameters
- Sequence Length: 4096 tokens
- Micro Batch Size: 1 (optimized for packed sequences)
- Max Steps: 50 (default, configurable)
- Training Data: Synthetic (SquadDataModule)
Requirements
Software Requirements
- NeMo Container:
nvcr.io/nvidia/nemo:24.12will be downloaded automatically. - HuggingFace Token: Required (set as
HF_TOKENenvironment variable). Get your token from https://huggingface.co/settings/tokens - Disk Space Requirements:
- 8B model: ~75GB (55GB base + 20GB model)
- 70B model: ~205GB (55GB base + 150GB model)
- 405B model: ~905GB (55GB base + 850GB model)
Hardware Requirements
The benchmark automatically calculates memory requirements based on model configuration:| Model | Type | FP8 Memory | BF16 Memory | LoRA Reduction |
|---|---|---|---|---|
| 8B | Full | 20GB | 35GB | ~30% |
| 70B | Full | 85GB | 160GB | ~30% |
| 405B | Full | 450GB | 850GB | ~30% |
- 8B LoRA FP8: ~14GB total (fits on single 40GB GPU)
- 70B LoRA FP8: ~60GB total (needs TP=2 on 40GB GPUs or single 80GB GPU)
- 405B LoRA FP8: ~315GB total (needs TP=8 on 80GB GPUs)
Parallelism Strategy
The benchmark automatically calculates optimal parallelism using a memory-aware strategy.Auto-Parallelism Algorithm
- Calculate Total Memory: Based on model size, dtype, and fine-tuning type
-
Determine Tensor Parallelism (TP):
-
Calculate Data Parallelism (DP):
-
Set Global Batch Size (GBS):
Parallelism Components
-
Tensor Parallelism (TP)
- Splits model layers across GPUs when model exceeds single GPU memory
- Automatically calculated as smallest power of 2 that fits
- Example: 70B model with FP16 (160GB) on 80GB GPUs requires TP=2
-
Data Parallelism (DP)
- Uses remaining GPUs for parallel batch processing:
DP = gpu_count / TP - Each DP replica processes different samples simultaneously
- Example: 8 GPUs with TP=2 gives DP=4
- Uses remaining GPUs for parallel batch processing:
-
Global Batch Size (GBS)
- Total samples per training iteration
- Auto-scaled based on DP with model-specific caps
- Formula:
GBS = min(DP * 2, model_cap)
Example Configurations
| GPUs | Model | Data Type | TP | DP | GBS |
|---|---|---|---|---|---|
| 8×A100-80GB | 8B | FP8 + LoRA | 1 | 8 | 16 |
| 8×H100-80GB | 70B | FP8 + LoRA | 1 | 8 | 16 |
Result Structure
Metrics Calculation
- Tokens per Step =
global_batch_size × 4096 - Tokens per Second =
tokens_per_step ÷ train_step_time_mean - Time to 1T Tokens =
10^12 ÷ (tokens_per_second × 86400)days
Field Metadata
| Field | Display Name | Unit |
|---|---|---|
tokens_per_step | Tokens per Step | tokens |
tokens_per_second | Tokens per Second | tokens/s |
train_step_time_mean | Training Step Time (Mean) | s |
train_step_time_std | Training Step Time (Std Dev) | s |
step_time_cv_percent | Step Time CV% | % |
time_to_1t_tokens_days | Time to 1T Tokens | days |