GPU Utilization Reporting

Comprehensive guide for reporting and visualizing GPU utilization for jobs on the HPC system

GPU Utilization Reporting

This guide shows how to configure, collect, and visualize GPU utilization metrics for jobs running on your HPC cluster. Once enabled, GPU utilization data will be visible in job detail modals (running and completed jobs), the metrics dashboard GPU KPI card, the admin GPU Analysis tab, and node hover cards.

Feature Overview

The GPU Utilization plugin (NEXT_PUBLIC_ENABLE_GPU_UTILIZATION) adds the following capabilities:

  • Per-job GPU stats — utilization percentage and memory usage displayed in job modals
  • GPU efficiency badge — shown alongside CPU/Memory efficiency for completed jobs
  • Metrics dashboard KPI — average cluster GPU utilization card on the Job Metrics page
  • Admin GPU Analysis tab — sortable table of all GPU jobs with underutilization filtering
  • Historical persistence — completed job GPU metrics stored in the database for later review

Live vs Historical Metrics

By default, GPU utilization reporting provides live metrics only while jobs are actively running. To enable historical metrics for completed jobs (with efficiency grades and detailed analysis), you must also enable the Job Metrics Plugin.

FeatureLive MetricsHistorical Metrics
Real-time GPU utilization while jobs run
GPU stats in job detail modals
Node hover card display
Data persists after job completes
GPU efficiency grades
Admin GPU Analysis tab
Metrics page GPU KPI card
Historical data queries
RequiresDCGM Exporter + PrometheusJob Metrics Plugin + Database

For a complete setup with historical metrics, follow the Historical GPU Metrics section in the Job Metrics configuration guide after completing the steps below.

Enable the GPU Utilization Plugin

Add the following to your .env file:

# Enable GPU utilization metrics
NEXT_PUBLIC_ENABLE_GPU_UTILIZATION="true"

# Required: Prometheus with DCGM exporter
PROMETHEUS_URL="http://your-prometheus-host:9090"

# Optional: scope GPU queries when one Prometheus scrapes multiple clusters
# Requires DCGM metrics/recording rules to include a matching cluster label
GPU_METRICS_CLUSTER="dev"

# Advanced: raw PromQL label matcher(s), overrides GPU_METRICS_CLUSTER
GPU_METRICS_LABEL_FILTER=""

# Advanced: capture-only label matcher(s) for POST /api/gpu
GPU_METRICS_CAPTURE_LABEL_FILTER=""

# Optional: Enable historical storage (requires Job Metrics plugin)
NEXT_PUBLIC_ENABLE_JOB_METRICS_PLUGIN="true"
SLURM_JOB_METRICS_DATABASE_URL="postgresql://user:password@localhost:5432/slurm_metrics"

Plugin Dependencies:

The GPU Utilization plugin requires Prometheus with DCGM exporter at minimum. For the Admin GPU Analysis tab and metrics KPI card, you also need the Job Metrics plugin enabled with a PostgreSQL database.

Prerequisites

  • Prometheus installed and enabled
  • NVIDIA DCGM Exporter
  • NVIDIA GPUs
  • Slurm workload manager

Configuration Steps

1. Enable HPC Job Details in DCGM-Exporter

First, update the DCGM Exporter service to include job mapping details:

# Edit the systemd service file
sudo vim /etc/systemd/system/dcgm-exporter.service

Use the following configuration:

[Unit]
Description=Nvidia Data Center GPU Manager Exporter
Wants=network-online.target
After=network-online.target

[Service]
Environment="DCGM_HPC_JOB_MAPPING_DIR=/var/run/dcgm_job_maps"
User=node_exporter
Group=node_exporter
Type=simple
ExecStartPre=/bin/bash -c 'mkdir -p "$DCGM_HPC_JOB_MAPPING_DIR" && chmod 775 "$DCGM_HPC_JOB_MAPPING_DIR"; for i in $(seq 0 $(( $(nvidia-smi -L | wc -l) - 1 ))); do FILE="$DCGM_HPC_JOB_MAPPING_DIR/$i"; touch "$FILE" && chmod 666 "$FILE"; [ -s "$FILE" ] || echo 0 > "$FILE"; done'
ExecStart=/usr/local/bin/dcgm_exporter -d f

[Install]
WantedBy=multi-user.target

Reload and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart dcgm-exporter

Verify Your Environment:

Ensure your directories and executables exist with proper permissions before proceeding.

2. Create Required Directories and Scripts

Create the job mapping directory on each node:

sudo mkdir -p /var/run/dcgm_job_maps
sudo chmod 775 /var/run/dcgm_job_maps

Create Slurm epilog and prolog scripts to track GPU allocation:

Epilog Script (/usr/local/bin/_dev_epilog):

#!/bin/bash
if [[ -n "${SLURM_JOB_GPUS}" ]]; then
    JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
    for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
            truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
            echo "0" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
    done
fi

Prolog Script (/usr/local/bin/_dev_prolog):

#!/bin/bash
if [[ -n ${SLURM_JOB_GPUS} ]]; then
        JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
        for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
                truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
                echo "$SLURM_JOB_ID" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
        done
fi

Set proper permissions:

sudo chmod +x /usr/local/bin/_dev_epilog
sudo chmod +x /usr/local/bin/_dev_prolog

3. Configure Slurm to Use the Scripts

Add the following to your slurm.conf:

TaskEpilog=/usr/local/bin/_dev_epilog
TaskProlog=/usr/local/bin/_dev_prolog

Restart Slurm services:

sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Verification

Once configured correctly, you should see the hpc_job label in your Prometheus metrics:

DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.154.05", Hostname="g002", UUID="GPU-4d5e36c6-c835-c5a2-65dc-833267ebf851", device="nvidia0", gpu="0", hpc_job="482", instance="192.168.1.2:9400", job="dcgm_exporter", modelName="NVIDIA GeForce GTX 2080ti", pci_bus_id="00000000:01:00.0"}

Multi-Cluster Prometheus

If one Prometheus server scrapes multiple Slurm clusters, add a cluster label in the Prometheus scrape config and set GPU_METRICS_CLUSTER in the dashboard .env to the same value. This is usually better than adding cluster data in the Slurm prolog or epilog because it labels every DCGM series consistently at scrape time.

Example for a DCGM scrape job where development GPU nodes are named sdg###:

scrape_configs:
  - job_name: 'DCGM Exporter'
    consul_sd_configs:
      - server: 'localhost:8500'
        services:
          - 'dcgm_exporter'
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*,_hostname=([^,]+),.*
        replacement: ${1}
        target_label: instance

      - source_labels: [instance]
        regex: 'sdg[0-9]+(\..*)?'
        target_label: cluster
        replacement: dev

Repeat the cluster relabel rule for each cluster or use Consul tags if they already contain a cluster identifier. After reloading Prometheus, verify the label:

count by (cluster, hpc_job, Hostname) (DCGM_FI_DEV_GPU_UTIL{hpc_job!="0", hpc_job!=""})

Troubleshooting

Common Issues

  1. Missing job labels: Ensure your epilog/prolog scripts have proper permissions and are being executed

  2. Zero utilization reported: Check that the GPU is properly allocated to the job in Slurm

  3. Service fails to start: Verify the paths in your service file and that the user has appropriate permissions

Debug Commands

Check if DCGM is running:

sudo systemctl status dcgm-exporter

View DCGM logs:

sudo journalctl -u dcgm-exporter

Verify job mapping files:

ls -la /var/run/dcgm_job_maps/

Prometheus Recording Rules

Prometheus recording rules precompute frequently needed or computationally expensive expressions and save their results as new time series. This is essential for GPU job metrics because:

  • Performance: Complex aggregations across many GPUs and jobs can be slow to compute on-the-fly
  • Historical Analysis: Recording rules create persistent time series that can be queried for historical trends
  • Dashboard Efficiency: Pre-aggregated metrics load faster in dashboards and reduce Prometheus query load

The dashboard uses a 3-tier query strategy when fetching GPU metrics for a job:

  1. Recording rules (fastest, preferred) — pre-computed aggregations
  2. Direct DCGM queries — fallback if recording rules aren't available
  3. Database lookup — for completed jobs when Prometheus data has expired

Setting Up Recording Rules

A complete recording rules file is included in the repository at infra/gpu_utilization.yml. The rules preserve the optional cluster label on job-level series and produce per-cluster rollups when that label exists. Copy it to your Prometheus rules directory:

sudo cp infra/gpu_utilization.yml /etc/prometheus/rules/gpu_utilization.yml

The file defines the following recording rules, evaluated every 1 minute:

Per-job metrics:

RuleDescription
job:gpu_utilization:current_avgAverage GPU utilization per job
job:gpu_utilization:current_p9595th percentile GPU utilization
job:gpu_memory:current_avg_pctAverage GPU memory utilization (%)
job:gpu_memory:current_max_pctMax GPU memory utilization (%)
job:gpu_underutilized:boolJobs with avg utilization below 30%
job:gpu_count:currentNumber of GPUs per job
job:gpu_memory:current_used_bytesTotal GPU memory used per job
job:gpu_memory:current_total_bytesTotal GPU memory per job

Time-windowed averages (for reports):

RuleDescription
job:gpu_utilization:1d_avg1-day rolling average utilization
job:gpu_utilization:7d_avg7-day rolling average utilization
job:gpu_utilization:30d_avg30-day rolling average utilization

Cluster-level rollups:

RuleDescription
cluster:gpu_utilization:current_avgCluster-wide average GPU utilization
cluster:underutilized_jobs:countTotal underutilized jobs
cluster:gpu_count:totalTotal active GPUs across all jobs
cluster:gpu_utilization:current_p95Cluster-wide P95 utilization
cluster:gpu_memory:current_avg_pctCluster-wide average memory utilization

Enable the Recording Rules

Add the rules file to your Prometheus configuration (/etc/prometheus/prometheus.yml):

rule_files:
  - "rules/gpu_utilization.yml"

Reload Prometheus to apply the changes:

sudo systemctl reload prometheus

Recording Rule Naming Convention:

The naming convention job:metric_name:aggregation follows Prometheus best practices. The prefix job: indicates these are job-level aggregations, making them easy to identify and query.

Verify Recording Rules

Check that the rules are loaded correctly:

curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="dcgm_job_metrics")'

Query one of the new metrics:

curl -s 'http://localhost:9090/api/v1/query?query=job:gpu_utilization:current_avg' | jq

GPU API Endpoints

The dashboard exposes GPU metrics through a consolidated API at /api/gpu:

GET /api/gpu?job_id=<id>

Returns GPU utilization data for a specific job. Uses the 3-tier fallback strategy (recording rules → direct DCGM → database).

Response:

{
  "status": 200,
  "data": {
    "jobId": "12345",
    "avgUtilization": 72.5,
    "memoryPct": 45.2,
    "gpuCount": 4,
    "isUnderutilized": false,
    "source": "prometheus"
  }
}

GET /api/gpu

Returns cluster-wide GPU overview (requires Job Metrics plugin and database). Supports optional from and to date query parameters.

POST /api/gpu

Triggers a GPU metrics capture — queries Prometheus for active GPU jobs matching GPU_METRICS_CLUSTER, GPU_METRICS_LABEL_FILTER, or GPU_METRICS_CAPTURE_LABEL_FILTER when configured, then upserts metrics into the database. Rate-limited to 1 capture per 60 seconds.

If GPU_METRICS_CAPTURE_TOKEN is configured, this endpoint requires either an Authorization: Bearer ... header or an x-api-key header with the configured token.

GET /api/gpu/report

Returns a full GPU utilization report with per-job breakdowns and system-level stats. Used by the admin GPU Analysis tab.

GET /api/gpu/node?name=<hostname>

Returns per-GPU utilization data for a specific node, used by node hover cards.

Where GPU Metrics Appear

Once the plugin is enabled, GPU utilization metrics appear in several places:

  1. Job Detail Modals — running jobs with GPUs show a utilization panel with avg utilization, memory usage, and per-GPU breakdown
  2. Historical Job Modals — completed jobs show a GPU Efficiency badge alongside CPU and Memory efficiency (requires database storage)
  3. Card Job Modals — expanded node card job views show GPU stats for GPU jobs
  4. Metrics Dashboard — an "Avg GPU Utilization" KPI card appears on the Job Metrics page when the plugin is enabled
  5. Admin Dashboard — a "GPU Analysis" tab provides a sortable table of all GPU jobs with filtering for underutilized jobs (requires both GPU Utilization and Job Metrics plugins)
  6. Admin Plugins Panel — shows the GPU Utilization plugin status

Next Steps

For live metrics only (current configuration):

  • GPU utilization will display in job modals while jobs are running
  • Metrics disappear after jobs complete

For historical metrics (recommended for complete analysis):