Configuration
Configuration options for the Job Metrics Dashboard
Configuration
To enable and configure the Job Metrics Dashboard in the frontend application, you need to set specific environment variables in your .env file.
Environment Variables
Enable the Plugin
Set the following variable to true to enable the Job Metrics UI components in the dashboard.
NEXT_PUBLIC_ENABLE_JOB_METRICS_PLUGIN=true
Database Connection
Provide the connection URL for the PostgreSQL database where the Go scraper is storing the metrics. This allows the dashboard to query the historical data.
# Format: postgres://user:password@host:port/database
SLURM_JOB_METRICS_DATABASE_URL="postgres://admin:securepass@localhost:5432/slurm_history"
Database Consistency:
Ensure this URL points to the same database (slurm_history) that the Go Scraper is writing to.
Scraper Configuration
The backend scraper has its own set of configuration variables (e.g., Slurm API tokens, sync intervals). Please refer to the Go Scraper Configuration section for those details.
Historical GPU Metrics
The Job Metrics plugin can capture and store GPU utilization metrics for completed jobs. This enables GPU efficiency grades, historical analysis, and trend reporting.
Prerequisites
Before enabling GPU metrics, ensure you have:
- GPU Utilization Reporting configured for live metrics: GPU Utilization Reporting
- Prometheus configured with DCGM exporter and recording rules
- Metrics Database (PostgreSQL) used by the Job Metrics plugin
Setup Steps
Step 1: Enable GPU Metrics
Add these environment variables to your .env file:
NEXT_PUBLIC_ENABLE_JOB_METRICS_PLUGIN=true
NEXT_PUBLIC_ENABLE_GPU_UTILIZATION=true
SLURM_JOB_METRICS_DATABASE_URL="postgres://user:password@host:5432/database"
PROMETHEUS_URL="http://your-prometheus-host:9090"
# Optional: scope GPU metrics when one Prometheus scrapes multiple clusters
# Requires DCGM metrics/recording rules to include a matching cluster label
GPU_METRICS_CLUSTER="dev"
# Advanced: raw PromQL label matcher(s), overrides GPU_METRICS_CLUSTER
GPU_METRICS_LABEL_FILTER=""
# Advanced: capture-only PromQL label matcher(s), overrides the global GPU filter
GPU_METRICS_CAPTURE_LABEL_FILTER=""
# Optional: require a secret for POST /api/gpu capture requests
GPU_METRICS_CAPTURE_TOKEN="replace-with-a-long-random-token"
New Plugin Flag:
NEXT_PUBLIC_ENABLE_GPU_UTILIZATION is a separate plugin flag from the Prometheus plugin. It specifically enables GPU job-level metrics collection, display in job modals, the Admin GPU Analysis tab, and the Metrics page GPU KPI card.
Capture Token:
GPU_METRICS_CAPTURE_TOKEN is optional. When it is set, POST /api/gpu requires either an Authorization: Bearer ... header or an x-api-key header with the same token. When it is not set, the capture endpoint remains open for local cron compatibility.
Shared Prometheus:
Use GPU_METRICS_CLUSTER when a shared Prometheus scrapes more than one Slurm cluster. The value must match the cluster label on your DCGM metrics and GPU recording rules. For advanced setups, GPU_METRICS_LABEL_FILTER accepts raw PromQL label matchers such as cluster="dev", instance=~"sdg.*".
Step 2: Verify Database Migration
The Slurm History Ingestor is a separate service/repository. Make sure it is installed and configured, and that the GPU metrics migration (003_add_gpu_metrics.sql) was applied during the ingestor setup.
Verify the table exists in your metrics database:
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -c "\d job_gpu_metrics"
If the table does not exist, apply the GPU metrics migration and re-run the check. See the Go Scraper installation guide for the correct migration steps.
Step 3: Set Up Metrics Capture
Create a capture script that will be called periodically by cron. The capture endpoint uses a POST request to /api/gpu:
#!/bin/bash
# GPU Metrics Capture Script
# Captures current GPU utilization for all running jobs
# and stores in the database for historical access after jobs complete
DASHBOARD_URL="http://localhost:3020"
if [[ -n "${GPU_METRICS_CAPTURE_TOKEN:-}" ]]; then
curl -s -X POST "$DASHBOARD_URL/api/gpu" \
-H "Authorization: Bearer $GPU_METRICS_CAPTURE_TOKEN"
else
curl -s -X POST "$DASHBOARD_URL/api/gpu"
fi
Save this as /path/to/gpu-capture.sh and make it executable:
chmod +x /path/to/gpu-capture.sh
Step 4: Configure Cron Job
Add to your crontab to run every 5 minutes:
*/5 * * * * /path/to/gpu-capture.sh >> /var/log/gpu-capture.log 2>&1
Or edit crontab directly:
crontab -e
Rate Limiting:
The endpoint is rate-limited to 1 capture per minute. Repeated calls within the limit return HTTP 429 with no side effects, making it safe to call more frequently if needed.
Step 5: Verify Setup
Check the capture endpoint:
curl -X POST http://localhost:3020/api/gpu
Preview what would be captured without writing to the database:
curl -X POST 'http://localhost:3020/api/gpu?dry_run=true' | jq
If GPU_METRICS_CAPTURE_TOKEN is configured:
curl -X POST http://localhost:3020/api/gpu \
-H "Authorization: Bearer $GPU_METRICS_CAPTURE_TOKEN"
Expected response:
{
"status": 200,
"message": "GPU metrics capture complete",
"captured": 3,
"updated": 2,
"markedComplete": 1,
"selector": "hpc_job!=\"0\", hpc_job!=\"\", cluster=\"dev\"",
"selectorSource": "metrics-env"
}
Check database for captured metrics:
SELECT job_id, avg_utilization, gpu_count, last_seen, is_complete
FROM job_gpu_metrics
ORDER BY last_seen DESC
LIMIT 10;
View a completed GPU job in the dashboard. You should see GPU efficiency grades and utilization stats.
How It Works
While jobs are running:
- Cron calls
POST /api/gpuevery 5 minutes - Endpoint queries Prometheus for active GPU jobs using DCGM metrics, scoped by
GPU_METRICS_CLUSTER,GPU_METRICS_LABEL_FILTER, orGPU_METRICS_CAPTURE_LABEL_FILTERwhen configured - Metrics are upserted into
job_gpu_metricswith one row per job; repeated captures update running averages, peaks, GPU count, sample count, andlast_seen - Jobs are tracked until they complete
When jobs complete:
- After 10 minutes of no metrics, jobs are marked
is_complete = true - Historical job detail views query the database instead of Prometheus
- GPU efficiency grades are calculated from stored averages
Query priority (3-tier fallback):
- Running jobs: Dashboard queries Prometheus recording rules first (fastest)
- Fallback: Direct DCGM queries if recording rules aren't available
- Completed jobs: Dashboard queries database (historical data)
Troubleshooting
No metrics being captured:
- Verify Prometheus is accessible and DCGM exporter is running:
curl http://your-prometheus-host:9090/api/v1/targets
- Check that DCGM metrics exist in Prometheus:
curl -s 'http://your-prometheus-host:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL' | jq
- If
GPU_METRICS_CLUSTERis set, verify DCGM metrics have a matching label:
curl -s 'http://your-prometheus-host:9090/api/v1/query?query=count%20by%20(cluster,hpc_job,Hostname)%20(DCGM_FI_DEV_GPU_UTIL%7Bhpc_job!%3D%220%22,hpc_job!%3D%22%22%7D)' | jq
- Ensure cron job is running:
grep gpu-capture /var/log/cron.log
- If
GPU_METRICS_CAPTURE_TOKENis set, verify the cron environment exports it and the capture script sends it as anAuthorization: Bearer ...orx-api-keyheader. Missing or incorrect tokens return HTTP 401.
Database errors:
- Verify the database connection string:
psql "$SLURM_JOB_METRICS_DATABASE_URL" -c "SELECT COUNT(*) FROM job_gpu_metrics;"
- Ensure migration was applied successfully:
psql "$SLURM_JOB_METRICS_DATABASE_URL" -c "\d job_gpu_metrics"
Historical jobs not showing GPU stats:
- Verify the job was running when capture was active
- Check if metrics exist:
psql "$SLURM_JOB_METRICS_DATABASE_URL" -c "SELECT * FROM job_gpu_metrics WHERE job_id = 'YOUR_JOB_ID';"
- Ensure both plugins are enabled in
.env - Check that the job has completed (should be marked
is_complete = true)