> ## Documentation Index
> Fetch the complete documentation index at: https://docs.silicondata.com/llms.txt
> Use this file to discover all available pages before exploring further.

# SiliconMark™ Benchmarks

> Comprehensive guide to SiliconMark™ benchmarks for GPU performance testing.

## Available Benchmarks

SiliconMark supports various benchmarks for GPU performance testing.

1. **QuickMark** - Comprehensive single-node GPU compute and memory performance test
2. **Cluster Network** - Multi-node network connectivity and bandwidth testing
3. **Inference Benchmark** - Multi-engine LLM inference performance (NVIDIA and AMD, using vLLM)
4. **Llama 3 Inference** - Single-node LLM inference performance using NVIDIA NIM
5. **Llama 3 Fine-Tuning** - Single-node LLM fine-tuning performance using NVIDIA NeMo

Each benchmark section includes configuration options, execution actions, result structures, and field metadata for interpreting the performance metrics.

***

## QuickMark Benchmark

### Overview

| Field            | Value                                                                |
| ---------------- | -------------------------------------------------------------------- |
| **Benchmark ID** | `quick_mark`                                                         |
| **Type**         | Single-node                                                          |
| **Min Nodes**    | 1                                                                    |
| **Description**  | Comprehensive GPU compute, memory, and interconnect performance test |

### Configuration

No configuration required — uses defaults.

### Result Structure

Results include one entry per GPU in `test_results` and a combined `aggregate_results`. For single-GPU systems, `aggregate_results` mirrors the single GPU result.

```json theme={null}
{
  "test_results": [
    {
      "gpu_id": "GPU-000...",
      "gpu_model": "NVIDIA H100 80GB HBM3",
      "fp32_tflops": 367.5,
      "fp32_cuda_core_tflops": 53.6,
      "fp16_tflops": 684.6,
      "bf16_tflops": 729.6,
      "fp8_tflops": 1456.2,
      "mixed_precision_tflops": 648.9,
      "memory_bandwidth_gbs": 3025.0,
      "l2_bandwidth_gbs": 415.5,
      "host_to_device_bandwidth_gbs": 27.7,
      "device_to_host_bandwidth_gbs": 28.5,
      "kernel_launch_overhead_us": 7.6,
      "power_consumption_watts": 654.3,
      "temperature_centigrade": 59,
      "fp32_tflops_per_peak_watt": 0.564,
      "fp16_tflops_per_peak_watt": 1.082,
      "fp8_tflops_per_peak_watt": 2.163,
      "energy_consumption_wh": 45.2,
      "total_vram_mib": 81559.0,
      "gpu_clocks": {
        "compute": { "base_mhz": 1110, "max_mhz": 1980 },
        "memory":  { "base_mhz": 2619, "max_mhz": 2619 }
      },
      "sample_frequency_s": 5.0,
      "measurements_temp": [58, 59, 60, 59],
      "measurements_power_draw": [640.1, 654.3, 651.0, 648.2],
      "core_utilization_percent": [98.0, 99.0, 98.5, 99.0],
      "memory_utilization_percent": [95.0, 96.0, 95.5, 96.0],
      "core_clock_mhz": [1965, 1980, 1975, 1980],
      "memory_clock_mhz": [2619, 2619, 2619, 2619]
    }
  ],
  "aggregate_results": {
    "gpu_id": "aggregate",
    "fp32_tflops": 2863.0,
    "fp32_cuda_core_tflops": 424.2,
    "fp16_tflops": 5492.8,
    "bf16_tflops": 5755.8,
    "mixed_precision_tflops": 4222.2,
    "memory_bandwidth_gbs": 22717.3,
    "allreduce_bandwidth_gbs": 275.2,
    "broadcast_bandwidth_gbs": 392.0,
    "host_to_device_bandwidth_gbs": 110.6,
    "device_to_host_bandwidth_gbs": 112.9,
    "power_consumption_watts": 5078.5,
    "temperature_centigrade": 68,
    "fp32_tflops_per_peak_watt": 0.564,
    "fp16_tflops_per_peak_watt": 1.082,
    "gpu_bandwidth_matrix": {
      "gpu0_to_gpu1": {
        "connection_type": "nvlink_v4_18x",
        "simplex_gbs": 388.1,
        "duplex_gbs": 389.2
      },
      "gpu0_to_gpu2": {
        "connection_type": "nvlink_v4_18x",
        "simplex_gbs": 389.2,
        "duplex_gbs": 389.9
      }
    }
  },
  "timestamp": "YYYY-MM-DDT20:03:06Z"
}
```

### Field Metadata

| Field                          | Display Name                  | Unit     | Notes                                         |
| ------------------------------ | ----------------------------- | -------- | --------------------------------------------- |
| `fp32_tflops`                  | FP32 Performance              | TFLOPS   | Tensor Core (TF32)                            |
| `fp32_cuda_core_tflops`        | FP32 CUDA Core Performance    | TFLOPS   | CUDA cores only                               |
| `fp16_tflops`                  | FP16 Performance              | TFLOPS   |                                               |
| `bf16_tflops`                  | BF16 Performance              | TFLOPS   |                                               |
| `fp8_tflops`                   | FP8 Performance               | TFLOPS   | Where supported                               |
| `mixed_precision_tflops`       | Mixed Precision Performance   | TFLOPS   | FP16 compute, FP32 accumulate                 |
| `memory_bandwidth_gbs`         | Memory Bandwidth              | GB/s     | HBM bandwidth                                 |
| `l2_bandwidth_gbs`             | L2 Cache Bandwidth            | GB/s     |                                               |
| `host_to_device_bandwidth_gbs` | Host to Device Bandwidth      | GB/s     | PCIe                                          |
| `device_to_host_bandwidth_gbs` | Device to Host Bandwidth      | GB/s     | PCIe                                          |
| `kernel_launch_overhead_us`    | Kernel Launch Overhead        | μs       |                                               |
| `allreduce_bandwidth_gbs`      | AllReduce Bandwidth           | GB/s     | Multi-GPU only                                |
| `broadcast_bandwidth_gbs`      | Broadcast Bandwidth           | GB/s     | Multi-GPU only                                |
| `gpu_bandwidth_matrix`         | GPU Bandwidth Matrix          | GB/s     | Per-pair simplex/duplex with connection type  |
| `fp32_tflops_per_peak_watt`    | FP32 TFLOPS per Peak Watt     | TFLOPS/W |                                               |
| `fp16_tflops_per_peak_watt`    | FP16 TFLOPS per Peak Watt     | TFLOPS/W |                                               |
| `fp8_tflops_per_peak_watt`     | FP8 TFLOPS per Peak Watt      | TFLOPS/W | Where supported                               |
| `energy_consumption_wh`        | Energy Consumption            | Wh       | Total for benchmark run                       |
| `power_consumption_watts`      | Power Consumption             | W        | Peak power draw                               |
| `temperature_centigrade`       | GPU Temperature               | °C       | Peak temperature                              |
| `total_vram_mib`               | Total VRAM                    | MiB      |                                               |
| `gpu_clocks`                   | GPU Clock Speeds              | MHz      | Compute and memory base/max/application/boost |
| `sample_frequency_s`           | Monitoring Sample Frequency   | s        |                                               |
| `measurements_temp`            | Temperature Timeseries        | °C       | One sample per interval                       |
| `measurements_power_draw`      | Power Draw Timeseries         | W        | One sample per interval                       |
| `core_utilization_percent`     | Core Utilization Timeseries   | %        | One sample per interval                       |
| `memory_utilization_percent`   | Memory Utilization Timeseries | %        | One sample per interval                       |
| `core_clock_mhz`               | Core Clock Timeseries         | MHz      | One sample per interval                       |
| `memory_clock_mhz`             | Memory Clock Timeseries       | MHz      | One sample per interval                       |

***

## Cluster Network Benchmark

### Overview

| Field            | Value                                                          |
| ---------------- | -------------------------------------------------------------- |
| **Benchmark ID** | `cluster_network`                                              |
| **Type**         | Multi-node                                                     |
| **Min Nodes**    | 2                                                              |
| **Description**  | Tests network throughput and latency between all cluster nodes |

### Result Structure

One measurement per directed node pair.

```json theme={null}
{
  "network_results": [
    {
      "host_ip": "192.168.1.10",
      "dest_ip": "192.168.1.11",
      "throughput_mbps": 45200.0,
      "throughput_gbps": 45.2,
      "latency_ms": 0.75
    }
  ],
  "measurement_count": 12
}
```

### Field Metadata

| Field               | Display Name       | Unit |
| ------------------- | ------------------ | ---- |
| `throughput_gbps`   | Throughput         | Gbps |
| `throughput_mbps`   | Throughput         | Mbps |
| `latency_ms`        | Latency (RTT)      | ms   |
| `measurement_count` | Total Links Tested |      |

***

## Inference Benchmark — vLLM

This benchmark measures LLM inference serving performance using vLLM. It supports both NVIDIA (CUDA) and AMD (ROCm) GPUs and runs without requiring an NGC API key.

### Overview

| Field            | Value                                                       |
| ---------------- | ----------------------------------------------------------- |
| **Benchmark ID** | `inference_benchmark`                                       |
| **Type**         | Single-node                                                 |
| **Min Nodes**    | 1                                                           |
| **GPU Support**  | NVIDIA and AMD                                              |
| **Description**  | Multi-engine LLM inference performance benchmark using vLLM |

### Configuration

| Field                | Type   | Required | Description                              | Default                 |
| -------------------- | ------ | -------- | ---------------------------------------- | ----------------------- |
| `inference_engine`   | string | No       | Inference engine                         | `"vllm"`                |
| `model`              | string | No       | Model name/path                          | `"openai/gpt-oss-120b"` |
| `tp`                 | int    | No       | Tensor parallel size                     | `1`                     |
| `concurrency`        | int    | Yes      | Concurrent requests                      | —                       |
| `isl`                | int    | Yes      | Input sequence length (1–131072)         | —                       |
| `osl`                | int    | Yes      | Output sequence length (1–131072)        | —                       |
| `random_range_ratio` | float  | No       | Prompt length variation                  | `0.0`                   |
| `num_prompts`        | int    | No       | Number of prompts (0 = concurrency × 10) | `0`                     |

### Result Structure

```json theme={null}
{
  "duration": 120.5,
  "total_input_tokens": 512000,
  "total_output_tokens": 128000,
  "request_throughput": 42.3,
  "output_token_throughput": 5890.4,
  "total_token_throughput": 8750.2,
  "mean_ttft_ms": 145.2,
  "median_ttft_ms": 138.7,
  "p99_ttft_ms": 312.4,
  "mean_tpot_ms": 18.4,
  "median_tpot_ms": 17.9,
  "p99_tpot_ms": 28.6,
  "mean_itl_ms": 18.4,
  "median_itl_ms": 17.9,
  "p99_itl_ms": 28.6,
  "mean_e2el_ms": 2340.5,
  "median_e2el_ms": 2180.3,
  "p99_e2el_ms": 4120.8
}
```

### Field Metadata

| Field                     | Display Name                 | Unit   |
| ------------------------- | ---------------------------- | ------ |
| `request_throughput`      | Request Throughput           | req/s  |
| `output_token_throughput` | Output Token Throughput      | tok/s  |
| `total_token_throughput`  | Total Token Throughput       | tok/s  |
| `total_input_tokens`      | Total Input Tokens           | tokens |
| `total_output_tokens`     | Total Output Tokens          | tokens |
| `mean_ttft_ms`            | TTFT (Mean)                  | ms     |
| `median_ttft_ms`          | TTFT (Median)                | ms     |
| `p99_ttft_ms`             | TTFT (P99)                   | ms     |
| `mean_tpot_ms`            | TPOT (Mean)                  | ms     |
| `median_tpot_ms`          | TPOT (Median)                | ms     |
| `p99_tpot_ms`             | TPOT (P99)                   | ms     |
| `mean_itl_ms`             | Inter-Token Latency (Mean)   | ms     |
| `median_itl_ms`           | Inter-Token Latency (Median) | ms     |
| `p99_itl_ms`              | Inter-Token Latency (P99)    | ms     |
| `mean_e2el_ms`            | End-to-End Latency (Mean)    | ms     |
| `median_e2el_ms`          | End-to-End Latency (Median)  | ms     |
| `p99_e2el_ms`             | End-to-End Latency (P99)     | ms     |
| `duration`                | Benchmark Duration           | s      |

***

## Llama 3 Inference — NIM

This benchmark measures LLM inference serving performance using NVIDIA NIM containers, driven by GenAI-Perf.

### Overview

| Field            | Value                                                    |
| ---------------- | -------------------------------------------------------- |
| **Benchmark ID** | `llama3_inf_single`                                      |
| **Type**         | Single-node                                              |
| **Min Nodes**    | 1                                                        |
| **Description**  | Llama 3 inference performance benchmark using NVIDIA NIM |

### Requirements

* **NGC API Key**: Required (set as `NGC_API_KEY` environment variable)
* **Podman**: Required to run NIM and GenAI-Perf containers

### Configuration

| Field         | Type | Required | Description                       |
| ------------- | ---- | -------- | --------------------------------- |
| `concurrency` | int  | Yes      | Concurrent requests (1–10000)     |
| `isl`         | int  | Yes      | Input sequence length (1–131072)  |
| `osl`         | int  | Yes      | Output sequence length (1–131072) |

Multiple configurations can be submitted in a single job run.

### Result Structure

Each configuration produces a `BenchmarkMetrics` object. Each metric field contains a full statistical distribution.

```json theme={null}
{
  "request_throughput":     { "unit": "req/s", "avg": 42.3, "p50": 41.8, "p90": 45.1, "p99": 47.2, "min": 38.0, "max": 48.5 },
  "request_latency":        { "unit": "ms",    "avg": 2340.5, "p50": 2180.3, "p90": 3800.1, "p99": 4120.8 },
  "time_to_first_token":    { "unit": "ms",    "avg": 145.2, "p50": 138.7, "p90": 290.4, "p99": 312.4 },
  "time_to_second_token":   { "unit": "ms",    "avg": 163.6, "p50": 156.8 },
  "inter_token_latency":    { "unit": "ms",    "avg": 18.4,  "p50": 17.9,  "p90": 25.1,  "p99": 28.6 },
  "output_token_throughput":           { "unit": "tok/s", "avg": 5890.4 },
  "output_token_throughput_per_request": { "unit": "tok/s", "avg": 139.1 },
  "output_sequence_length": { "unit": "tokens", "avg": 512.0 },
  "input_sequence_length":  { "unit": "tokens", "avg": 256.0 }
}
```

Each metric object may include: `avg`, `p25`, `p50`, `p75`, `p90`, `p95`, `p99`, `min`, `max`, `std`.

### Field Metadata

| Field                                 | Display Name                        | Unit   |
| ------------------------------------- | ----------------------------------- | ------ |
| `request_throughput`                  | Request Throughput                  | req/s  |
| `request_latency`                     | Request Latency                     | ms     |
| `time_to_first_token`                 | Time to First Token (TTFT)          | ms     |
| `time_to_second_token`                | Time to Second Token                | ms     |
| `inter_token_latency`                 | Inter-Token Latency (ITL)           | ms     |
| `output_token_throughput`             | Output Token Throughput             | tok/s  |
| `output_token_throughput_per_request` | Output Token Throughput per Request | tok/s  |
| `output_sequence_length`              | Output Sequence Length              | tokens |
| `input_sequence_length`               | Input Sequence Length               | tokens |

***

## Llama 3 Fine-Tuning — NeMo

This benchmark measures LLM fine-tuning performance using NVIDIA's NeMo framework with automatic memory-aware parallelism configuration.

### Overview

| Field            | Value                                                       |
| ---------------- | ----------------------------------------------------------- |
| **Benchmark ID** | `llama3_ft_single`                                          |
| **Type**         | Single-node                                                 |
| **Min Nodes**    | 1                                                           |
| **Description**  | Llama 3 fine-tuning performance benchmark using NVIDIA NeMo |

### Configuration

| Field            | Type   | Required | Description        | Default  | Constraints               |
| ---------------- | ------ | -------- | ------------------ | -------- | ------------------------- |
| `model_size`     | string | No       | Model size         | `"8b"`   | `"8b"`, `"70b"`, `"405b"` |
| `dtype`          | string | No       | Data type          | `"fp8"`  | `"fp8"`, `"bf16"`         |
| `fine_tune_type` | string | No       | Fine-tuning method | `"lora"` | `"lora"`, `"full"`        |
| `max_steps`      | int    | No       | Training steps     | `50`     |                           |

#### Fixed Parameters

* **Sequence Length**: 4096 tokens
* **Micro Batch Size**: 1 (optimized for packed sequences)
* **Training Data**: Synthetic (SquadDataModule)

### Requirements

#### Software Requirements

* **NeMo Container**: `nvcr.io/nvidia/nemo:25.11.01` — downloaded automatically if not present
* **HuggingFace Token**: Required (set as `HF_TOKEN` environment variable). Get your token from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
* **Docker**: Required to run the NeMo container
* **Disk Space**:
  * 8B model: \~75GB (55GB base + 20GB model)
  * 70B model: \~205GB (55GB base + 150GB model)
  * 405B model: \~905GB (55GB base + 850GB model)

```bash theme={null}
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxx"  # Required
export STAGE_PATH="$HOME/workspace/benchmark_stage"  # Optional, defaults to ~/workspace/benchmark_stage
```

#### Hardware Requirements

The benchmark automatically calculates memory requirements based on model configuration:

| Model | FP8 Memory  | BF16 Memory | LoRA reduction (\~30%) |
| ----- | ----------- | ----------- | ---------------------- |
| 8B    | 20GB total  | 35GB total  | \~14GB / \~25GB        |
| 70B   | 85GB total  | 160GB total | \~60GB / \~112GB       |
| 405B  | 450GB total | 850GB total | \~315GB / N/A          |

**Minimum GPU requirements:**

* 8B LoRA: 1× GPU with ≥16GB VRAM
* 8B full: 1× GPU with ≥24GB VRAM
* 70B: ≥2 GPUs
* 405B: ≥8 GPUs (FP8 + LoRA only)

### Parallelism Strategy

The benchmark automatically calculates optimal parallelism using a memory-aware strategy:

1. **Tensor Parallelism (TP)** = smallest power of 2 such that `total_memory / TP ≤ gpu_memory`
2. **Data Parallelism (DP)** = `total_gpus / TP`
3. **Global Batch Size (GBS)** = `min(DP × 2, model_cap)` — caps: 8B→64, 70B→32, 405B→16

#### Example Configurations

| GPUs    | Model | dtype | Fine-tune | TP | DP | GBS |
| ------- | ----- | ----- | --------- | -- | -- | --- |
| 8× 80GB | 8B    | fp8   | lora      | 1  | 8  | 16  |
| 8× 80GB | 70B   | fp8   | lora      | 1  | 8  | 16  |
| 8× 80GB | 405B  | fp8   | lora      | 8  | 1  | 2   |

### Result Structure

```json theme={null}
{
  "tokens_per_step": 65536,
  "tokens_per_second": 48617.2,
  "train_step_time_mean": 1.348,
  "train_step_time_std": 0.003,
  "step_time_cv_percent": 0.223,
  "time_to_1t_tokens_days": 238.1,
  "peak_memory_gb": 68.4,
  "memory_efficiency_percent": 85.5
}
```

#### Metrics Calculation

* **Tokens per Step** = `global_batch_size × sequence_length`
* **Tokens per Second** = `tokens_per_step ÷ train_step_time_mean`
* **Time to 1T Tokens** = `10¹² ÷ (tokens_per_second × 86400)` days
* **Step Time CV** = `(train_step_time_std ÷ train_step_time_mean) × 100`

### Field Metadata

| Field                       | Display Name                       | Unit   |
| --------------------------- | ---------------------------------- | ------ |
| `tokens_per_step`           | Tokens per Step                    | tokens |
| `tokens_per_second`         | Tokens per Second                  | tok/s  |
| `train_step_time_mean`      | Training Step Time (Mean)          | s      |
| `train_step_time_std`       | Training Step Time (Std Dev)       | s      |
| `step_time_cv_percent`      | Step Time Coefficient of Variation | %      |
| `time_to_1t_tokens_days`    | Time to 1T Tokens                  | days   |
| `peak_memory_gb`            | Peak GPU Memory Usage              | GB     |
| `memory_efficiency_percent` | GPU Memory Efficiency              | %      |
