GPU Utilization Reporting

Comprehensive guide for reporting and visualizing GPU utilization for jobs on the HPC system

GPU Utilization Reporting

This guide shows how to configure, collect, and visualize GPU utilization metrics for jobs running on your HPC cluster. Once enabled, GPU utilization data will be visible in job detail modals (running and completed jobs), the metrics dashboard GPU KPI card, the admin GPU Analysis tab, and node hover cards.

Feature Overview

The GPU Utilization plugin (NEXT_PUBLIC_ENABLE_GPU_UTILIZATION) adds the following capabilities:

Per-job GPU stats — utilization percentage and memory usage displayed in job modals
GPU efficiency badge — shown alongside CPU/Memory efficiency for completed jobs
Metrics dashboard KPI — average cluster GPU utilization card on the Job Metrics page
Admin GPU Analysis tab — sortable table of all GPU jobs with underutilization filtering
Historical persistence — completed job GPU metrics stored in the database for later review

Live vs Historical Metrics

By default, GPU utilization reporting provides live metrics only while jobs are actively running. To enable historical metrics for completed jobs (with efficiency grades and detailed analysis), you must also enable the Job Metrics Plugin.

Feature	Live Metrics	Historical Metrics
Real-time GPU utilization while jobs run	✓	✓
GPU stats in job detail modals	✓	✓
Node hover card display	✓	✓
Data persists after job completes		✓
GPU efficiency grades		✓
Admin GPU Analysis tab		✓
Metrics page GPU KPI card		✓
Historical data queries		✓
Requires	DCGM Exporter + Prometheus	Job Metrics Plugin + Database

For a complete setup with historical metrics, follow the Historical GPU Metrics section in the Job Metrics configuration guide after completing the steps below.

Enable the GPU Utilization Plugin

Add the following to your .env file:

# Enable GPU utilization metrics
NEXT_PUBLIC_ENABLE_GPU_UTILIZATION="true"

# Required: Prometheus with DCGM exporter
PROMETHEUS_URL="http://your-prometheus-host:9090"

# Optional: scope GPU queries when one Prometheus scrapes multiple clusters
# Requires DCGM metrics/recording rules to include a matching cluster label
GPU_METRICS_CLUSTER="dev"

# Advanced: raw PromQL label matcher(s), overrides GPU_METRICS_CLUSTER
GPU_METRICS_LABEL_FILTER=""

# Advanced: capture-only label matcher(s) for POST /api/gpu
GPU_METRICS_CAPTURE_LABEL_FILTER=""

# Optional: Enable historical storage (requires Job Metrics plugin)
NEXT_PUBLIC_ENABLE_JOB_METRICS_PLUGIN="true"
SLURM_JOB_METRICS_DATABASE_URL="postgresql://user:password@localhost:5432/slurm_metrics"

Plugin Dependencies:

The GPU Utilization plugin requires Prometheus with DCGM exporter at minimum. For the Admin GPU Analysis tab and metrics KPI card, you also need the Job Metrics plugin enabled with a PostgreSQL database.

Prerequisites

Prometheus installed and enabled
NVIDIA DCGM Exporter
NVIDIA GPUs
Slurm workload manager

Configuration Steps

1. Enable HPC Job Details in DCGM-Exporter

First, update the DCGM Exporter service to include job mapping details:

# Edit the systemd service file
sudo vim /etc/systemd/system/dcgm-exporter.service

Use the following configuration:

[Unit]
Description=Nvidia Data Center GPU Manager Exporter
Wants=network-online.target
After=network-online.target

[Service]
Environment="DCGM_HPC_JOB_MAPPING_DIR=/var/run/dcgm_job_maps"
User=node_exporter
Group=node_exporter
Type=simple
ExecStartPre=/bin/bash -c 'mkdir -p "$DCGM_HPC_JOB_MAPPING_DIR" && chmod 775 "$DCGM_HPC_JOB_MAPPING_DIR"; for i in $(seq 0 $(( $(nvidia-smi -L | wc -l) - 1 ))); do FILE="$DCGM_HPC_JOB_MAPPING_DIR/$i"; touch "$FILE" && chmod 666 "$FILE"; [ -s "$FILE" ] || echo 0 > "$FILE"; done'
ExecStart=/usr/local/bin/dcgm_exporter -d f

[Install]
WantedBy=multi-user.target

Reload and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart dcgm-exporter

Verify Your Environment:

Ensure your directories and executables exist with proper permissions before proceeding.

2. Create Required Directories and Scripts

Create the job mapping directory on each node:

sudo mkdir -p /var/run/dcgm_job_maps
sudo chmod 775 /var/run/dcgm_job_maps

Create Slurm epilog and prolog scripts to track GPU allocation:

Epilog Script (/usr/local/bin/_dev_epilog):

#!/bin/bash
if [[ -n "${SLURM_JOB_GPUS}" ]]; then
    JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
    for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
            truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
            echo "0" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
    done
fi

Prolog Script (/usr/local/bin/_dev_prolog):

#!/bin/bash
if [[ -n ${SLURM_JOB_GPUS} ]]; then
        JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
        for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
                truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
                echo "$SLURM_JOB_ID" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
        done
fi

Set proper permissions:

sudo chmod +x /usr/local/bin/_dev_epilog
sudo chmod +x /usr/local/bin/_dev_prolog

3. Configure Slurm to Use the Scripts

Add the following to your slurm.conf:

TaskEpilog=/usr/local/bin/_dev_epilog
TaskProlog=/usr/local/bin/_dev_prolog

Restart Slurm services:

sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Verification

Once configured correctly, you should see the hpc_job label in your Prometheus metrics:

DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.154.05", Hostname="g002", UUID="GPU-4d5e36c6-c835-c5a2-65dc-833267ebf851", device="nvidia0", gpu="0", hpc_job="482", instance="192.168.1.2:9400", job="dcgm_exporter", modelName="NVIDIA GeForce GTX 2080ti", pci_bus_id="00000000:01:00.0"}

Multi-Cluster Prometheus

If one Prometheus server scrapes multiple Slurm clusters, add a cluster label in the Prometheus scrape config and set GPU_METRICS_CLUSTER in the dashboard .env to the same value. This is usually better than adding cluster data in the Slurm prolog or epilog because it labels every DCGM series consistently at scrape time.

Example for a DCGM scrape job where development GPU nodes are named sdg###:

scrape_configs:
  - job_name: 'DCGM Exporter'
    consul_sd_configs:
      - server: 'localhost:8500'
        services:
          - 'dcgm_exporter'
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*,_hostname=([^,]+),.*
        replacement: ${1}
        target_label: instance

      - source_labels: [instance]
        regex: 'sdg[0-9]+(\..*)?'
        target_label: cluster
        replacement: dev

Repeat the cluster relabel rule for each cluster or use Consul tags if they already contain a cluster identifier. After reloading Prometheus, verify the label:

count by (cluster, hpc_job, Hostname) (DCGM_FI_DEV_GPU_UTIL{hpc_job!="0", hpc_job!=""})

Troubleshooting

Common Issues

Missing job labels: Ensure your epilog/prolog scripts have proper permissions and are being executed
Zero utilization reported: Check that the GPU is properly allocated to the job in Slurm
Service fails to start: Verify the paths in your service file and that the user has appropriate permissions

Debug Commands

Check if DCGM is running:

sudo systemctl status dcgm-exporter

View DCGM logs:

sudo journalctl -u dcgm-exporter

Verify job mapping files:

ls -la /var/run/dcgm_job_maps/

Prometheus Recording Rules

Prometheus recording rules precompute frequently needed or computationally expensive expressions and save their results as new time series. This is essential for GPU job metrics because:

Performance: Complex aggregations across many GPUs and jobs can be slow to compute on-the-fly
Historical Analysis: Recording rules create persistent time series that can be queried for historical trends
Dashboard Efficiency: Pre-aggregated metrics load faster in dashboards and reduce Prometheus query load

The dashboard uses a 3-tier query strategy when fetching GPU metrics for a job:

Recording rules (fastest, preferred) — pre-computed aggregations
Direct DCGM queries — fallback if recording rules aren't available
Database lookup — for completed jobs when Prometheus data has expired

Setting Up Recording Rules

A complete recording rules file is included in the repository at infra/gpu_utilization.yml. The rules preserve the optional cluster label on job-level series and produce per-cluster rollups when that label exists. Copy it to your Prometheus rules directory:

sudo cp infra/gpu_utilization.yml /etc/prometheus/rules/gpu_utilization.yml

The file defines the following recording rules, evaluated every 1 minute:

Per-job metrics:

Rule	Description
`job:gpu_utilization:current_avg`	Average GPU utilization per job
`job:gpu_utilization:current_p95`	95th percentile GPU utilization
`job:gpu_memory:current_avg_pct`	Average GPU memory utilization (%)
`job:gpu_memory:current_max_pct`	Max GPU memory utilization (%)
`job:gpu_underutilized:bool`	Jobs with avg utilization below 30%
`job:gpu_count:current`	Number of GPUs per job
`job:gpu_memory:current_used_bytes`	Total GPU memory used per job
`job:gpu_memory:current_total_bytes`	Total GPU memory per job

Time-windowed averages (for reports):

Rule	Description
`job:gpu_utilization:1d_avg`	1-day rolling average utilization
`job:gpu_utilization:7d_avg`	7-day rolling average utilization
`job:gpu_utilization:30d_avg`	30-day rolling average utilization

Cluster-level rollups:

Rule	Description
`cluster:gpu_utilization:current_avg`	Cluster-wide average GPU utilization
`cluster:underutilized_jobs:count`	Total underutilized jobs
`cluster:gpu_count:total`	Total active GPUs across all jobs
`cluster:gpu_utilization:current_p95`	Cluster-wide P95 utilization
`cluster:gpu_memory:current_avg_pct`	Cluster-wide average memory utilization

Enable the Recording Rules

Add the rules file to your Prometheus configuration (/etc/prometheus/prometheus.yml):

rule_files:
  - "rules/gpu_utilization.yml"

Reload Prometheus to apply the changes:

sudo systemctl reload prometheus

Recording Rule Naming Convention:

The naming convention job:metric_name:aggregation follows Prometheus best practices. The prefix job: indicates these are job-level aggregations, making them easy to identify and query.

Verify Recording Rules

Check that the rules are loaded correctly:

curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="dcgm_job_metrics")'

Query one of the new metrics:

curl -s 'http://localhost:9090/api/v1/query?query=job:gpu_utilization:current_avg' | jq

GPU API Endpoints

The dashboard exposes GPU metrics through a consolidated API at /api/gpu:

GET `/api/gpu?job_id=<id>`

Returns GPU utilization data for a specific job. Uses the 3-tier fallback strategy (recording rules → direct DCGM → database).

Response:

{
  "status": 200,
  "data": {
    "jobId": "12345",
    "avgUtilization": 72.5,
    "memoryPct": 45.2,
    "gpuCount": 4,
    "isUnderutilized": false,
    "source": "prometheus"
  }
}

GET `/api/gpu`

Returns cluster-wide GPU overview (requires Job Metrics plugin and database). Supports optional from and to date query parameters.

POST `/api/gpu`

Triggers a GPU metrics capture — queries Prometheus for active GPU jobs matching GPU_METRICS_CLUSTER, GPU_METRICS_LABEL_FILTER, or GPU_METRICS_CAPTURE_LABEL_FILTER when configured, then upserts metrics into the database. Rate-limited to 1 capture per 60 seconds.

If GPU_METRICS_CAPTURE_TOKEN is configured, this endpoint requires either an Authorization: Bearer ... header or an x-api-key header with the configured token.

GET `/api/gpu/report`

Returns a full GPU utilization report with per-job breakdowns and system-level stats. Used by the admin GPU Analysis tab.

GET `/api/gpu/node?name=<hostname>`

Returns per-GPU utilization data for a specific node, used by node hover cards.

Where GPU Metrics Appear

Once the plugin is enabled, GPU utilization metrics appear in several places:

Job Detail Modals — running jobs with GPUs show a utilization panel with avg utilization, memory usage, and per-GPU breakdown
Historical Job Modals — completed jobs show a GPU Efficiency badge alongside CPU and Memory efficiency (requires database storage)
Card Job Modals — expanded node card job views show GPU stats for GPU jobs
Metrics Dashboard — an "Avg GPU Utilization" KPI card appears on the Job Metrics page when the plugin is enabled
Admin Dashboard — a "GPU Analysis" tab provides a sortable table of all GPU jobs with filtering for underutilized jobs (requires both GPU Utilization and Job Metrics plugins)
Admin Plugins Panel — shows the GPU Utilization plugin status

Next Steps

For live metrics only (current configuration):

GPU utilization will display in job modals while jobs are running
Metrics disappear after jobs complete

For historical metrics (recommended for complete analysis):

Enable the Job Metrics Plugin following the Historical GPU Metrics setup guide
This enables GPU efficiency grades and persistent storage of metrics in the database
View GPU metrics for completed jobs in the job details page

PreviousHistorical Data

NextOverview

Getting Started

Dashboard Overview

Installation

Advanced Features

AI Integration

Reporting

Integrations

Customization

Tutorials

GPU Utilization Reporting

GPU Utilization Reporting

Feature Overview

Live vs Historical Metrics

Enable the GPU Utilization Plugin

Prerequisites

Configuration Steps

1. Enable HPC Job Details in DCGM-Exporter

2. Create Required Directories and Scripts

3. Configure Slurm to Use the Scripts

Verification

Multi-Cluster Prometheus

Troubleshooting

Common Issues

Debug Commands

Prometheus Recording Rules

Setting Up Recording Rules

Enable the Recording Rules

Verify Recording Rules

GPU API Endpoints

GET `/api/gpu?job_id=<id>`

GET `/api/gpu`

POST `/api/gpu`

GET `/api/gpu/report`

GET `/api/gpu/node?name=<hostname>`

Where GPU Metrics Appear

Next Steps

On this page

Getting Started

Dashboard Overview

Installation

Advanced Features

AI Integration

Reporting

Integrations

Customization

Tutorials

GPU Utilization Reporting

GPU Utilization Reporting

Feature Overview

Live vs Historical Metrics

Enable the GPU Utilization Plugin

Prerequisites

Configuration Steps

1. Enable HPC Job Details in DCGM-Exporter

2. Create Required Directories and Scripts

3. Configure Slurm to Use the Scripts

Verification

Multi-Cluster Prometheus

Troubleshooting

Common Issues

Debug Commands

Prometheus Recording Rules

Setting Up Recording Rules

Enable the Recording Rules

Verify Recording Rules

GPU API Endpoints

GET /api/gpu?job_id=<id>

GET /api/gpu

POST /api/gpu

GET /api/gpu/report

GET /api/gpu/node?name=<hostname>

Where GPU Metrics Appear

Next Steps

On this page

GET `/api/gpu?job_id=<id>`

GET `/api/gpu`

POST `/api/gpu`

GET `/api/gpu/report`

GET `/api/gpu/node?name=<hostname>`