Configuration

Configuration options for the Job Metrics Dashboard

Configuration

To enable and configure the Job Metrics Dashboard in the frontend application, you need to set specific environment variables in your .env file.

Environment Variables

Enable the Plugin

Set the following variable to true to enable the Job Metrics UI components in the dashboard.

NEXT_PUBLIC_ENABLE_JOB_METRICS_PLUGIN=true

Database Connection

Provide the connection URL for the PostgreSQL database where the Go scraper is storing the metrics. This allows the dashboard to query the historical data.

# Format: postgres://user:password@host:port/database
SLURM_JOB_METRICS_DATABASE_URL="postgres://admin:securepass@localhost:5432/slurm_history"

Database Consistency:

Ensure this URL points to the same database (slurm_history) that the Go Scraper is writing to.

Scraper Configuration

The backend scraper has its own set of configuration variables (e.g., Slurm API tokens, sync intervals). Please refer to the Go Scraper Configuration section for those details.

Historical GPU Metrics

The Job Metrics plugin can capture and store GPU utilization metrics for completed jobs. This enables GPU efficiency grades, historical analysis, and trend reporting.

Prerequisites

Before enabling GPU metrics, ensure you have:

  1. GPU Utilization Reporting configured for live metrics: GPU Utilization Reporting
  2. Prometheus configured with DCGM exporter and recording rules
  3. Metrics Database (PostgreSQL) used by the Job Metrics plugin

Setup Steps

Step 1: Enable GPU Metrics

Add these environment variables to your .env file:

NEXT_PUBLIC_ENABLE_JOB_METRICS_PLUGIN=true
NEXT_PUBLIC_ENABLE_GPU_UTILIZATION=true
SLURM_JOB_METRICS_DATABASE_URL="postgres://user:password@host:5432/database"
PROMETHEUS_URL="http://your-prometheus-host:9090"

# Optional: scope GPU metrics when one Prometheus scrapes multiple clusters
# Requires DCGM metrics/recording rules to include a matching cluster label
GPU_METRICS_CLUSTER="dev"

# Advanced: raw PromQL label matcher(s), overrides GPU_METRICS_CLUSTER
GPU_METRICS_LABEL_FILTER=""

# Advanced: capture-only PromQL label matcher(s), overrides the global GPU filter
GPU_METRICS_CAPTURE_LABEL_FILTER=""

# Optional: require a secret for POST /api/gpu capture requests
GPU_METRICS_CAPTURE_TOKEN="replace-with-a-long-random-token"

New Plugin Flag:

NEXT_PUBLIC_ENABLE_GPU_UTILIZATION is a separate plugin flag from the Prometheus plugin. It specifically enables GPU job-level metrics collection, display in job modals, the Admin GPU Analysis tab, and the Metrics page GPU KPI card.

Capture Token:

GPU_METRICS_CAPTURE_TOKEN is optional. When it is set, POST /api/gpu requires either an Authorization: Bearer ... header or an x-api-key header with the same token. When it is not set, the capture endpoint remains open for local cron compatibility.

Shared Prometheus:

Use GPU_METRICS_CLUSTER when a shared Prometheus scrapes more than one Slurm cluster. The value must match the cluster label on your DCGM metrics and GPU recording rules. For advanced setups, GPU_METRICS_LABEL_FILTER accepts raw PromQL label matchers such as cluster="dev", instance=~"sdg.*".

Step 2: Verify Database Migration

The Slurm History Ingestor is a separate service/repository. Make sure it is installed and configured, and that the GPU metrics migration (003_add_gpu_metrics.sql) was applied during the ingestor setup.

Verify the table exists in your metrics database:

psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -c "\d job_gpu_metrics"

If the table does not exist, apply the GPU metrics migration and re-run the check. See the Go Scraper installation guide for the correct migration steps.

Step 3: Set Up Metrics Capture

Create a capture script that will be called periodically by cron. The capture endpoint uses a POST request to /api/gpu:

#!/bin/bash
# GPU Metrics Capture Script
# Captures current GPU utilization for all running jobs
# and stores in the database for historical access after jobs complete

DASHBOARD_URL="http://localhost:3020"

if [[ -n "${GPU_METRICS_CAPTURE_TOKEN:-}" ]]; then
  curl -s -X POST "$DASHBOARD_URL/api/gpu" \
    -H "Authorization: Bearer $GPU_METRICS_CAPTURE_TOKEN"
else
  curl -s -X POST "$DASHBOARD_URL/api/gpu"
fi

Save this as /path/to/gpu-capture.sh and make it executable:

chmod +x /path/to/gpu-capture.sh

Step 4: Configure Cron Job

Add to your crontab to run every 5 minutes:

*/5 * * * * /path/to/gpu-capture.sh >> /var/log/gpu-capture.log 2>&1

Or edit crontab directly:

crontab -e

Rate Limiting:

The endpoint is rate-limited to 1 capture per minute. Repeated calls within the limit return HTTP 429 with no side effects, making it safe to call more frequently if needed.

Step 5: Verify Setup

Check the capture endpoint:

curl -X POST http://localhost:3020/api/gpu

Preview what would be captured without writing to the database:

curl -X POST 'http://localhost:3020/api/gpu?dry_run=true' | jq

If GPU_METRICS_CAPTURE_TOKEN is configured:

curl -X POST http://localhost:3020/api/gpu \
  -H "Authorization: Bearer $GPU_METRICS_CAPTURE_TOKEN"

Expected response:

{
  "status": 200,
  "message": "GPU metrics capture complete",
  "captured": 3,
  "updated": 2,
  "markedComplete": 1,
  "selector": "hpc_job!=\"0\", hpc_job!=\"\", cluster=\"dev\"",
  "selectorSource": "metrics-env"
}

Check database for captured metrics:

SELECT job_id, avg_utilization, gpu_count, last_seen, is_complete
FROM job_gpu_metrics
ORDER BY last_seen DESC
LIMIT 10;

View a completed GPU job in the dashboard. You should see GPU efficiency grades and utilization stats.

How It Works

While jobs are running:

  • Cron calls POST /api/gpu every 5 minutes
  • Endpoint queries Prometheus for active GPU jobs using DCGM metrics, scoped by GPU_METRICS_CLUSTER, GPU_METRICS_LABEL_FILTER, or GPU_METRICS_CAPTURE_LABEL_FILTER when configured
  • Metrics are upserted into job_gpu_metrics with one row per job; repeated captures update running averages, peaks, GPU count, sample count, and last_seen
  • Jobs are tracked until they complete

When jobs complete:

  • After 10 minutes of no metrics, jobs are marked is_complete = true
  • Historical job detail views query the database instead of Prometheus
  • GPU efficiency grades are calculated from stored averages

Query priority (3-tier fallback):

  1. Running jobs: Dashboard queries Prometheus recording rules first (fastest)
  2. Fallback: Direct DCGM queries if recording rules aren't available
  3. Completed jobs: Dashboard queries database (historical data)

Troubleshooting

No metrics being captured:

  1. Verify Prometheus is accessible and DCGM exporter is running:
curl http://your-prometheus-host:9090/api/v1/targets
  1. Check that DCGM metrics exist in Prometheus:
curl -s 'http://your-prometheus-host:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL' | jq
  1. If GPU_METRICS_CLUSTER is set, verify DCGM metrics have a matching label:
curl -s 'http://your-prometheus-host:9090/api/v1/query?query=count%20by%20(cluster,hpc_job,Hostname)%20(DCGM_FI_DEV_GPU_UTIL%7Bhpc_job!%3D%220%22,hpc_job!%3D%22%22%7D)' | jq
  1. Ensure cron job is running:
grep gpu-capture /var/log/cron.log
  1. If GPU_METRICS_CAPTURE_TOKEN is set, verify the cron environment exports it and the capture script sends it as an Authorization: Bearer ... or x-api-key header. Missing or incorrect tokens return HTTP 401.

Database errors:

  1. Verify the database connection string:
psql "$SLURM_JOB_METRICS_DATABASE_URL" -c "SELECT COUNT(*) FROM job_gpu_metrics;"
  1. Ensure migration was applied successfully:
psql "$SLURM_JOB_METRICS_DATABASE_URL" -c "\d job_gpu_metrics"

Historical jobs not showing GPU stats:

  1. Verify the job was running when capture was active
  2. Check if metrics exist:
psql "$SLURM_JOB_METRICS_DATABASE_URL" -c "SELECT * FROM job_gpu_metrics WHERE job_id = 'YOUR_JOB_ID';"
  1. Ensure both plugins are enabled in .env
  2. Check that the job has completed (should be marked is_complete = true)