GPU Utilization Reporting

Comprehensive guide for reporting and visualizing GPU utilization for jobs on the HPC system

GPU Utilization Reporting

This guide shows how to configure, collect, and visualize GPU utilization metrics for jobs running on your HPC cluster. Once enabled, GPU utilization data will be visible in node hover cards on the dashboard and in the historical job data views.

Live vs Historical Metrics

By default, GPU utilization reporting provides live metrics only while jobs are actively running. To enable historical metrics for completed jobs (with efficiency grades and detailed analysis), you must also enable the Job Metrics Plugin.

FeatureLive MetricsHistorical Metrics
Real-time GPU utilization while jobs run
Node hover card display
Data persists after job completes
GPU efficiency grades (A-E)
Historical data queries
RequiresDCGM Exporter + PrometheusJob Metrics Plugin + Database

For a complete setup with historical metrics, follow the Historical GPU Metrics section in the Job Metrics guide after completing the steps below.

Prerequisites

  • Prometheus installed and enabled
  • NVIDIA DCGM Exporter
  • NVIDIA GPUs
  • Slurm workload manager

Configuration Steps

1. Enable HPC Job Details in DCGM-Exporter

First, update the DCGM Exporter service to include job mapping details:

# Edit the systemd service file
sudo vim /etc/systemd/system/dcgm-exporter.service

Use the following configuration:

[Unit]
Description=Nvidia Data Center GPU Manager Exporter
Wants=network-online.target
After=network-online.target

[Service]
Environment="DCGM_HPC_JOB_MAPPING_DIR=/var/run/dcgm_job_maps"
User=node_exporter
Group=node_exporter
Type=simple
ExecStartPre=/bin/bash -c 'mkdir -p "$DCGM_HPC_JOB_MAPPING_DIR" && chmod 775 "$DCGM_HPC_JOB_MAPPING_DIR"; for i in $(seq 0 $(( $(nvidia-smi -L | wc -l) - 1 ))); do FILE="$DCGM_HPC_JOB_MAPPING_DIR/$i"; touch "$FILE" && chmod 666 "$FILE"; [ -s "$FILE" ] || echo 0 > "$FILE"; done'
ExecStart=/usr/local/bin/dcgm_exporter -d f

[Install]
WantedBy=multi-user.target

Reload and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart dcgm-exporter

Verify Your Environment:

Ensure your directories and executables exist with proper permissions before proceeding.

2. Create Required Directories and Scripts

Create the job mapping directory on each node:

sudo mkdir -p /var/run/dcgm_job_maps
sudo chmod 775 /var/run/dcgm_job_maps

Create Slurm epilog and prolog scripts to track GPU allocation:

Epilog Script (/usr/local/bin/_dev_epilog):

#!/bin/bash
if [[ -n "${SLURM_JOB_GPUS}" ]]; then
    JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
    for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
            truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
            echo "0" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
    done
fi

Prolog Script (/usr/local/bin/_dev_prolog):

#!/bin/bash
if [[ -n ${SLURM_JOB_GPUS} ]]; then
        JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
        for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
                truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
                echo "$SLURM_JOB_ID" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
        done
fi

Set proper permissions:

sudo chmod +x /usr/local/bin/_dev_epilog
sudo chmod +x /usr/local/bin/_dev_prolog

3. Configure Slurm to Use the Scripts

Add the following to your slurm.conf:

TaskEpilog=/usr/local/bin/_dev_epilog
TaskProlog=/usr/local/bin/_dev_prolog

Restart Slurm services:

sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Verification

Once configured correctly, you should see the hpc_job label in your Prometheus metrics:

DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.154.05", Hostname="g002", UUID="GPU-4d5e36c6-c835-c5a2-65dc-833267ebf851", device="nvidia0", gpu="0", hpc_job="482", instance="192.168.1.2:9400", job="dcgm_exporter", modelName="NVIDIA GeForce GTX 2080ti", pci_bus_id="00000000:01:00.0"}

Troubleshooting

Common Issues

  1. Missing job labels: Ensure your epilog/prolog scripts have proper permissions and are being executed

  2. Zero utilization reported: Check that the GPU is properly allocated to the job in Slurm

  3. Service fails to start: Verify the paths in your service file and that the user has appropriate permissions

Debug Commands

Check if DCGM is running:

sudo systemctl status dcgm-exporter

View DCGM logs:

sudo journalctl -u dcgm-exporter

Verify job mapping files:

ls -la /var/run/dcgm_job_maps/

Prometheus Recording Rules

Prometheus recording rules allow you to precompute frequently needed or computationally expensive expressions and save their results as new time series. This is essential for GPU job metrics because:

  • Performance: Complex aggregations across many GPUs and jobs can be slow to compute on-the-fly
  • Historical Analysis: Recording rules create persistent time series that can be queried for historical trends
  • Dashboard Efficiency: Pre-aggregated metrics load faster in dashboards and reduce Prometheus query load

Setting Up Recording Rules

An example recording rules file is included in the repository at infra/gpu_utilization.yml. You can use this as a reference or copy it directly to your Prometheus rules directory.

Create or edit your Prometheus recording rules file (typically /etc/prometheus/rules/dcgm_job_rules.yml):

groups:
  - name: dcgm_job_metrics
    interval: 1m
    rules:
      # Average GPU utilization per job
      - record: job:gpu_utilization:current_avg
        expr: |
          avg by(hpc_job) (
            DCGM_FI_DEV_GPU_UTIL{hpc_job!="0"}
          )

      # 95th percentile GPU utilization per job
      - record: job:gpu_utilization:current_p95
        expr: |
          quantile by(hpc_job) (0.95,
            DCGM_FI_DEV_GPU_UTIL{hpc_job!="0"}
          )

      # Max memory utilization per job (percentage)
      - record: job:gpu_memory:current_max_pct
        expr: |
          max by(hpc_job) (
            DCGM_FI_DEV_FB_USED{hpc_job!="0"} /
            (DCGM_FI_DEV_FB_USED{hpc_job!="0"} + DCGM_FI_DEV_FB_FREE{hpc_job!="0"}) * 100
          )

      # Average memory utilization per job (percentage)
      - record: job:gpu_memory:current_avg_pct
        expr: |
          avg by(hpc_job) (
            DCGM_FI_DEV_FB_USED{hpc_job!="0"} /
            (DCGM_FI_DEV_FB_USED{hpc_job!="0"} + DCGM_FI_DEV_FB_FREE{hpc_job!="0"}) * 100
          )

      # Potentially underutilized jobs (under 30% GPU utilization)
      - record: job:gpu_underutilized:bool
        expr: |
          avg by(hpc_job) (
            DCGM_FI_DEV_GPU_UTIL{hpc_job!="0"}
          ) < 30

      # Time-based aggregations for GPU utilization
      - record: job:gpu_utilization:1d_avg
        expr: |
          avg_over_time(
            job:gpu_utilization:current_avg[1d]
          )

      - record: job:gpu_utilization:7d_avg
        expr: |
          avg_over_time(
            job:gpu_utilization:current_avg[7d]
          )

      - record: job:gpu_utilization:30d_avg
        expr: |
          avg_over_time(
            job:gpu_utilization:current_avg[30d]
          )

      # Count of GPUs per job
      - record: job:gpu_count:current
        expr: |
          count by(hpc_job) (
            DCGM_FI_DEV_GPU_UTIL{hpc_job!="0"}
          )

      # Memory stats per job (absolute values)
      - record: job:gpu_memory:current_used_bytes
        expr: |
          sum by(hpc_job) (
            DCGM_FI_DEV_FB_USED{hpc_job!="0"}
          )

      - record: job:gpu_memory:current_total_bytes
        expr: |
          sum by(hpc_job) (
            DCGM_FI_DEV_FB_USED{hpc_job!="0"} + DCGM_FI_DEV_FB_FREE{hpc_job!="0"}
          )

Enable the Recording Rules

Add the rules file to your Prometheus configuration (/etc/prometheus/prometheus.yml):

rule_files:
  - "rules/dcgm_job_rules.yml"

Reload Prometheus to apply the changes:

sudo systemctl reload prometheus

Recording Rule Naming Convention:

The naming convention job:metric_name:aggregation follows Prometheus best practices. The prefix job: indicates these are job-level aggregations, making them easy to identify and query.

Verify Recording Rules

Check that the rules are loaded correctly:

curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="dcgm_job_metrics")'

Query one of the new metrics:

curl -s 'http://localhost:9090/api/v1/query?query=job:gpu_utilization:current_avg' | jq

Next Steps

For live metrics only (current configuration):

  • GPU utilization will display in node hover cards while jobs are running
  • Metrics disappear after jobs complete

For historical metrics (recommended for complete analysis):