GPU Utilization Reporting
Comprehensive guide for reporting and visualizing GPU utilization for jobs on the HPC system
GPU Utilization Reporting
This guide shows how to configure, collect, and visualize GPU utilization metrics for jobs running on your HPC cluster. Once enabled, GPU utilization data will be visible in node hover cards on the dashboard and in the historical job data views.
Live vs Historical Metrics
By default, GPU utilization reporting provides live metrics only while jobs are actively running. To enable historical metrics for completed jobs (with efficiency grades and detailed analysis), you must also enable the Job Metrics Plugin.
| Feature | Live Metrics | Historical Metrics |
|---|---|---|
| Real-time GPU utilization while jobs run | ✓ | ✓ |
| Node hover card display | ✓ | ✓ |
| Data persists after job completes | ✓ | |
| GPU efficiency grades (A-E) | ✓ | |
| Historical data queries | ✓ | |
| Requires | DCGM Exporter + Prometheus | Job Metrics Plugin + Database |
For a complete setup with historical metrics, follow the Historical GPU Metrics section in the Job Metrics guide after completing the steps below.
Prerequisites
- Prometheus installed and enabled
- NVIDIA DCGM Exporter
- NVIDIA GPUs
- Slurm workload manager
Configuration Steps
1. Enable HPC Job Details in DCGM-Exporter
First, update the DCGM Exporter service to include job mapping details:
# Edit the systemd service file
sudo vim /etc/systemd/system/dcgm-exporter.service
Use the following configuration:
[Unit]
Description=Nvidia Data Center GPU Manager Exporter
Wants=network-online.target
After=network-online.target
[Service]
Environment="DCGM_HPC_JOB_MAPPING_DIR=/var/run/dcgm_job_maps"
User=node_exporter
Group=node_exporter
Type=simple
ExecStartPre=/bin/bash -c 'mkdir -p "$DCGM_HPC_JOB_MAPPING_DIR" && chmod 775 "$DCGM_HPC_JOB_MAPPING_DIR"; for i in $(seq 0 $(( $(nvidia-smi -L | wc -l) - 1 ))); do FILE="$DCGM_HPC_JOB_MAPPING_DIR/$i"; touch "$FILE" && chmod 666 "$FILE"; [ -s "$FILE" ] || echo 0 > "$FILE"; done'
ExecStart=/usr/local/bin/dcgm_exporter -d f
[Install]
WantedBy=multi-user.target
Reload and restart the service:
sudo systemctl daemon-reload
sudo systemctl restart dcgm-exporter
Verify Your Environment:
Ensure your directories and executables exist with proper permissions before proceeding.
2. Create Required Directories and Scripts
Create the job mapping directory on each node:
sudo mkdir -p /var/run/dcgm_job_maps
sudo chmod 775 /var/run/dcgm_job_maps
Create Slurm epilog and prolog scripts to track GPU allocation:
Epilog Script (/usr/local/bin/_dev_epilog):
#!/bin/bash
if [[ -n "${SLURM_JOB_GPUS}" ]]; then
JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
echo "0" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
done
fi
Prolog Script (/usr/local/bin/_dev_prolog):
#!/bin/bash
if [[ -n ${SLURM_JOB_GPUS} ]]; then
JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
echo "$SLURM_JOB_ID" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
done
fi
Set proper permissions:
sudo chmod +x /usr/local/bin/_dev_epilog
sudo chmod +x /usr/local/bin/_dev_prolog
3. Configure Slurm to Use the Scripts
Add the following to your slurm.conf:
TaskEpilog=/usr/local/bin/_dev_epilog
TaskProlog=/usr/local/bin/_dev_prolog
Restart Slurm services:
sudo systemctl restart slurmctld
sudo systemctl restart slurmd
Verification
Once configured correctly, you should see the hpc_job label in your Prometheus metrics:
DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.154.05", Hostname="g002", UUID="GPU-4d5e36c6-c835-c5a2-65dc-833267ebf851", device="nvidia0", gpu="0", hpc_job="482", instance="192.168.1.2:9400", job="dcgm_exporter", modelName="NVIDIA GeForce GTX 2080ti", pci_bus_id="00000000:01:00.0"}
Troubleshooting
Common Issues
-
Missing job labels: Ensure your epilog/prolog scripts have proper permissions and are being executed
-
Zero utilization reported: Check that the GPU is properly allocated to the job in Slurm
-
Service fails to start: Verify the paths in your service file and that the user has appropriate permissions
Debug Commands
Check if DCGM is running:
sudo systemctl status dcgm-exporter
View DCGM logs:
sudo journalctl -u dcgm-exporter
Verify job mapping files:
ls -la /var/run/dcgm_job_maps/
Prometheus Recording Rules
Prometheus recording rules allow you to precompute frequently needed or computationally expensive expressions and save their results as new time series. This is essential for GPU job metrics because:
- Performance: Complex aggregations across many GPUs and jobs can be slow to compute on-the-fly
- Historical Analysis: Recording rules create persistent time series that can be queried for historical trends
- Dashboard Efficiency: Pre-aggregated metrics load faster in dashboards and reduce Prometheus query load
Setting Up Recording Rules
An example recording rules file is included in the repository at infra/gpu_utilization.yml. You can use this as a reference or copy it directly to your Prometheus rules directory.
Create or edit your Prometheus recording rules file (typically /etc/prometheus/rules/dcgm_job_rules.yml):
groups:
- name: dcgm_job_metrics
interval: 1m
rules:
# Average GPU utilization per job
- record: job:gpu_utilization:current_avg
expr: |
avg by(hpc_job) (
DCGM_FI_DEV_GPU_UTIL{hpc_job!="0"}
)
# 95th percentile GPU utilization per job
- record: job:gpu_utilization:current_p95
expr: |
quantile by(hpc_job) (0.95,
DCGM_FI_DEV_GPU_UTIL{hpc_job!="0"}
)
# Max memory utilization per job (percentage)
- record: job:gpu_memory:current_max_pct
expr: |
max by(hpc_job) (
DCGM_FI_DEV_FB_USED{hpc_job!="0"} /
(DCGM_FI_DEV_FB_USED{hpc_job!="0"} + DCGM_FI_DEV_FB_FREE{hpc_job!="0"}) * 100
)
# Average memory utilization per job (percentage)
- record: job:gpu_memory:current_avg_pct
expr: |
avg by(hpc_job) (
DCGM_FI_DEV_FB_USED{hpc_job!="0"} /
(DCGM_FI_DEV_FB_USED{hpc_job!="0"} + DCGM_FI_DEV_FB_FREE{hpc_job!="0"}) * 100
)
# Potentially underutilized jobs (under 30% GPU utilization)
- record: job:gpu_underutilized:bool
expr: |
avg by(hpc_job) (
DCGM_FI_DEV_GPU_UTIL{hpc_job!="0"}
) < 30
# Time-based aggregations for GPU utilization
- record: job:gpu_utilization:1d_avg
expr: |
avg_over_time(
job:gpu_utilization:current_avg[1d]
)
- record: job:gpu_utilization:7d_avg
expr: |
avg_over_time(
job:gpu_utilization:current_avg[7d]
)
- record: job:gpu_utilization:30d_avg
expr: |
avg_over_time(
job:gpu_utilization:current_avg[30d]
)
# Count of GPUs per job
- record: job:gpu_count:current
expr: |
count by(hpc_job) (
DCGM_FI_DEV_GPU_UTIL{hpc_job!="0"}
)
# Memory stats per job (absolute values)
- record: job:gpu_memory:current_used_bytes
expr: |
sum by(hpc_job) (
DCGM_FI_DEV_FB_USED{hpc_job!="0"}
)
- record: job:gpu_memory:current_total_bytes
expr: |
sum by(hpc_job) (
DCGM_FI_DEV_FB_USED{hpc_job!="0"} + DCGM_FI_DEV_FB_FREE{hpc_job!="0"}
)
Enable the Recording Rules
Add the rules file to your Prometheus configuration (/etc/prometheus/prometheus.yml):
rule_files:
- "rules/dcgm_job_rules.yml"
Reload Prometheus to apply the changes:
sudo systemctl reload prometheus
Recording Rule Naming Convention:
The naming convention job:metric_name:aggregation follows Prometheus best practices. The prefix job: indicates these are job-level aggregations, making them easy to identify and query.
Verify Recording Rules
Check that the rules are loaded correctly:
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="dcgm_job_metrics")'
Query one of the new metrics:
curl -s 'http://localhost:9090/api/v1/query?query=job:gpu_utilization:current_avg' | jq
Next Steps
For live metrics only (current configuration):
- GPU utilization will display in node hover cards while jobs are running
- Metrics disappear after jobs complete
For historical metrics (recommended for complete analysis):
- Enable the Job Metrics Plugin following the Historical GPU Metrics setup guide
- This enables GPU efficiency grades and persistent storage of metrics in the database
- View GPU metrics for completed jobs in the job details page