Changelog
Release notes and version history for the Slurm Node Dashboard.
Changelog
All notable changes to the Slurm Node Dashboard are documented here.
v2.1.5
YAML-Driven LLM Configuration
infra/llm-assistant.yaml— New YAML configuration file that drives the entire AI assistant: system prompt, cluster identity, tool definitions, job defaults, custom instructions, and restricted topics. No more editing application code to customize the AI.LLM_CONFIG_PATH— New environment variable to specify a custom path for the YAML config (defaults toinfra/llm-assistant.yaml).- Dynamic tool & prompt builder — The chat route now calls
loadLLMConfig()andbuildToolsAndPrompt()to construct tools and system prompt dynamically from the YAML file. - 60-second config cache — Config changes take effect automatically without a server restart.
- Custom tool support — Define your own tools in YAML with
executionblocks that call Slurm REST, SlurmDB, or external HTTP APIs with{param}placeholder interpolation. - Per-tool prompt guidance — Each tool supports a
prompt_guidancefield that tells the AI how to interpret and present results. - Restricted topics — Define topics the AI should redirect to support channels with custom messages.
New Workflow Tools
troubleshoot_job— Multi-step diagnostic that gathers job details, partition state, and node health in a single tool call for comprehensive failure analysis.sbatch_helper— Fetches available partitions, QoS levels, and cluster info to generate informed sbatch scripts using your cluster's actual configuration.node_health_check— Comprehensive node health report with resource utilization, current jobs, and partition context.- Rich UI cards — All workflow tools render structured visual cards in the chat interface.
Admin → LLM Config Panel
- New "LLM Config" tab in the Admin dashboard with a visual editor for the YAML configuration.
- Cluster identity editor — Edit name, organization, documentation URL, and support email.
- Tool management — Enable/disable built-in tools, create custom tools with execution endpoints.
- System prompt & instructions editor — Edit system prompt and custom instructions in-browser.
- Restricted topics manager — Add, edit, and remove restricted topic redirects.
- Raw YAML editor — Switch to raw editing mode for advanced configuration.
- Test chat preview — Test configuration changes with the built-in chat interface.
- Config API —
GET /api/llm-configandPUT /api/llm-configfor reading and updating the config programmatically.
Chat Improvements
- Step limiting — Tool-call loops are now limited to 5 steps via
stepCountIs(5)to prevent runaway tool chains. - Forced first-step tool calls — The first LLM step forces tool execution (no text generation), preventing the model from hallucinating data before tools return results.
- Markdown table stripping — LLM output is automatically stripped of markdown tables since tool cards already display all data.
- Error handling — Improved error handling in the chat modal with try/catch for message sends and loading-state guards.
- Chat UI refresh — Updated chat modal styling with backdrop blur, new bot icon, and refined input area.
GPU Metrics Capture Security
GPU_METRICS_CAPTURE_TOKEN- Optional server-side secret forPOST /api/gpu; when configured, capture requests must send eitherAuthorization: Bearer <token>orx-api-key: <token>.GPU_METRICS_CLUSTER- Optional server-side cluster filter for GPU Prometheus queries when one Prometheus scrapes multiple Slurm clusters.GPU_METRICS_LABEL_FILTER/GPU_METRICS_CAPTURE_LABEL_FILTER- Advanced PromQL label matcher overrides for global GPU queries or capture-only snapshots.- Cluster-aware GPU recording rules -
infra/gpu_utilization.ymlnow preserves theclusterlabel on per-job rules and emits per-cluster rollups.
Dependency Upgrades
- AI SDK —
aiupgraded from v5 to v6,@ai-sdk/openaifrom v2 to v3,@ai-sdk/reactfrom v2 to v3. js-yaml— New dependency for YAML parsing/serialization of the LLM config.
v2.1.1
GPU Utilization Plugin Overhaul
- New unified GPU API — All GPU metric endpoints consolidated under
/api/gpu:GET /api/gpu— Fetch current GPU metrics from PrometheusPOST /api/gpu— Capture and persist a GPU metrics snapshot (replaces/api/gpu-metrics/capture)GET /api/gpu/report— Retrieve historical GPU utilization reportsGET /api/gpu/node— Fetch per-node GPU statistics
- 3-tier query fallback strategy — The GPU metrics engine now tries recording rules first, then falls back to raw PromQL, then to DCGM-only queries, for maximum compatibility across Prometheus setups.
NEXT_PUBLIC_ENABLE_GPU_UTILIZATION— New environment variable acts as a feature flag to enable/disable the GPU utilization plugin across the entire dashboard.- Recording rules shipped in-repo — Pre-built Prometheus recording rules now live at
infra/gpu_utilization.yml(replaces the olddcgm_job_rules.yml), covering per-job, time-windowed, and cluster-level aggregation.
GPU Metrics in the Dashboard UI
- Job modals — Running jobs now display real-time GPU utilization, memory usage, and temperature; completed jobs show a GPU Efficiency badge.
- Admin → GPU Analysis tab — New admin panel section with system-level GPU overview, per-job GPU table, search/filter, and auto-refresh.
- Node card hover — GPU statistics are now visible in node card hover modals when the plugin is enabled.
JWT Token Management
- JWT Token Info card on the Admin dashboard — Shows token validity, expiry status, and days remaining.
SLURM_TOKEN_EXPIRY_WARNING_DAYS— New environment variable (default30) to configure when expiry warnings appear.- Slack webhook alerts — Optional
SLACK_WEBHOOK_URLenv var enables automated Slack notifications for token expiry via a cron-friendly endpoint:GET /api/slurm/jwt/alert.
Node Configuration Changes
node.cfgrelocated — Configuration file moved from the project root toinfra/node.cfg.NODEVIEW_CONFIG_PATH— New environment variable to specify a custom path fornode.cfg(defaults toinfra/node.cfg).excludedNodessupport — Thenode.cfgfile now supports anexcludedNodesarray to hide specific nodes from the rack view.rackLayoutwrapper — Rack definitions are now nested under arackLayoutkey in the configuration file.- Comment support —
node.cfgnow allows//comments for inline documentation.
Environment & Infrastructure
.env.productionrelocated — The environment template moved from the project root toinfra/.env.production. Installation commands updated tomv infra/.env.production .env.- New environment variables added:
NODEVIEW_CONFIG_PATH— Custom path for node configurationSLACK_WEBHOOK_URL— Slack webhook for JWT expiry alertsSLURM_TOKEN_EXPIRY_WARNING_DAYS— Token warning threshold in daysNEXT_PUBLIC_ENABLE_GPU_UTILIZATION— Feature flag for GPU plugin
Update Script Improvements
- The update script now preserves
infra/.env.productionas an untouched template — it is never modified during updates. - Environment comparison instructions updated to use
diff .env infra/.env.production.