Changelog

Release notes and version history for the Slurm Node Dashboard.

Changelog

All notable changes to the Slurm Node Dashboard are documented here.


v2.1.5

YAML-Driven LLM Configuration

  • infra/llm-assistant.yaml — New YAML configuration file that drives the entire AI assistant: system prompt, cluster identity, tool definitions, job defaults, custom instructions, and restricted topics. No more editing application code to customize the AI.
  • LLM_CONFIG_PATH — New environment variable to specify a custom path for the YAML config (defaults to infra/llm-assistant.yaml).
  • Dynamic tool & prompt builder — The chat route now calls loadLLMConfig() and buildToolsAndPrompt() to construct tools and system prompt dynamically from the YAML file.
  • 60-second config cache — Config changes take effect automatically without a server restart.
  • Custom tool support — Define your own tools in YAML with execution blocks that call Slurm REST, SlurmDB, or external HTTP APIs with {param} placeholder interpolation.
  • Per-tool prompt guidance — Each tool supports a prompt_guidance field that tells the AI how to interpret and present results.
  • Restricted topics — Define topics the AI should redirect to support channels with custom messages.

New Workflow Tools

  • troubleshoot_job — Multi-step diagnostic that gathers job details, partition state, and node health in a single tool call for comprehensive failure analysis.
  • sbatch_helper — Fetches available partitions, QoS levels, and cluster info to generate informed sbatch scripts using your cluster's actual configuration.
  • node_health_check — Comprehensive node health report with resource utilization, current jobs, and partition context.
  • Rich UI cards — All workflow tools render structured visual cards in the chat interface.

Admin → LLM Config Panel

  • New "LLM Config" tab in the Admin dashboard with a visual editor for the YAML configuration.
  • Cluster identity editor — Edit name, organization, documentation URL, and support email.
  • Tool management — Enable/disable built-in tools, create custom tools with execution endpoints.
  • System prompt & instructions editor — Edit system prompt and custom instructions in-browser.
  • Restricted topics manager — Add, edit, and remove restricted topic redirects.
  • Raw YAML editor — Switch to raw editing mode for advanced configuration.
  • Test chat preview — Test configuration changes with the built-in chat interface.
  • Config APIGET /api/llm-config and PUT /api/llm-config for reading and updating the config programmatically.

Chat Improvements

  • Step limiting — Tool-call loops are now limited to 5 steps via stepCountIs(5) to prevent runaway tool chains.
  • Forced first-step tool calls — The first LLM step forces tool execution (no text generation), preventing the model from hallucinating data before tools return results.
  • Markdown table stripping — LLM output is automatically stripped of markdown tables since tool cards already display all data.
  • Error handling — Improved error handling in the chat modal with try/catch for message sends and loading-state guards.
  • Chat UI refresh — Updated chat modal styling with backdrop blur, new bot icon, and refined input area.

GPU Metrics Capture Security

  • GPU_METRICS_CAPTURE_TOKEN - Optional server-side secret for POST /api/gpu; when configured, capture requests must send either Authorization: Bearer <token> or x-api-key: <token>.
  • GPU_METRICS_CLUSTER - Optional server-side cluster filter for GPU Prometheus queries when one Prometheus scrapes multiple Slurm clusters.
  • GPU_METRICS_LABEL_FILTER / GPU_METRICS_CAPTURE_LABEL_FILTER - Advanced PromQL label matcher overrides for global GPU queries or capture-only snapshots.
  • Cluster-aware GPU recording rules - infra/gpu_utilization.yml now preserves the cluster label on per-job rules and emits per-cluster rollups.

Dependency Upgrades

  • AI SDKai upgraded from v5 to v6, @ai-sdk/openai from v2 to v3, @ai-sdk/react from v2 to v3.
  • js-yaml — New dependency for YAML parsing/serialization of the LLM config.

v2.1.1

GPU Utilization Plugin Overhaul

  • New unified GPU API — All GPU metric endpoints consolidated under /api/gpu:
    • GET /api/gpu — Fetch current GPU metrics from Prometheus
    • POST /api/gpu — Capture and persist a GPU metrics snapshot (replaces /api/gpu-metrics/capture)
    • GET /api/gpu/report — Retrieve historical GPU utilization reports
    • GET /api/gpu/node — Fetch per-node GPU statistics
  • 3-tier query fallback strategy — The GPU metrics engine now tries recording rules first, then falls back to raw PromQL, then to DCGM-only queries, for maximum compatibility across Prometheus setups.
  • NEXT_PUBLIC_ENABLE_GPU_UTILIZATION — New environment variable acts as a feature flag to enable/disable the GPU utilization plugin across the entire dashboard.
  • Recording rules shipped in-repo — Pre-built Prometheus recording rules now live at infra/gpu_utilization.yml (replaces the old dcgm_job_rules.yml), covering per-job, time-windowed, and cluster-level aggregation.

GPU Metrics in the Dashboard UI

  • Job modals — Running jobs now display real-time GPU utilization, memory usage, and temperature; completed jobs show a GPU Efficiency badge.
  • Admin → GPU Analysis tab — New admin panel section with system-level GPU overview, per-job GPU table, search/filter, and auto-refresh.
  • Node card hover — GPU statistics are now visible in node card hover modals when the plugin is enabled.

JWT Token Management

  • JWT Token Info card on the Admin dashboard — Shows token validity, expiry status, and days remaining.
  • SLURM_TOKEN_EXPIRY_WARNING_DAYS — New environment variable (default 30) to configure when expiry warnings appear.
  • Slack webhook alerts — Optional SLACK_WEBHOOK_URL env var enables automated Slack notifications for token expiry via a cron-friendly endpoint: GET /api/slurm/jwt/alert.

Node Configuration Changes

  • node.cfg relocated — Configuration file moved from the project root to infra/node.cfg.
  • NODEVIEW_CONFIG_PATH — New environment variable to specify a custom path for node.cfg (defaults to infra/node.cfg).
  • excludedNodes support — The node.cfg file now supports an excludedNodes array to hide specific nodes from the rack view.
  • rackLayout wrapper — Rack definitions are now nested under a rackLayout key in the configuration file.
  • Comment supportnode.cfg now allows // comments for inline documentation.

Environment & Infrastructure

  • .env.production relocated — The environment template moved from the project root to infra/.env.production. Installation commands updated to mv infra/.env.production .env.
  • New environment variables added:
    • NODEVIEW_CONFIG_PATH — Custom path for node configuration
    • SLACK_WEBHOOK_URL — Slack webhook for JWT expiry alerts
    • SLURM_TOKEN_EXPIRY_WARNING_DAYS — Token warning threshold in days
    • NEXT_PUBLIC_ENABLE_GPU_UTILIZATION — Feature flag for GPU plugin

Update Script Improvements

  • The update script now preserves infra/.env.production as an untouched template — it is never modified during updates.
  • Environment comparison instructions updated to use diff .env infra/.env.production.