Slurm History Ingestor

Overview

The Slurm History Ingestor is a standalone Go service that syncs job history from a Slurm HPC cluster's REST API into PostgreSQL for analytics and reporting.

Key Features:

  • Incremental Syncing - Only fetches new jobs since last sync
  • Robust Data Handling - Uses lookback window to catch out-of-order jobs
  • Data Normalization - Converts TRES strings to query-friendly columns
  • Multi-Cluster Support - Tag records with cluster name for centralized databases

Prerequisites

RequirementVersionNotes
PostgreSQL13+Database storage
Slurm20.11+With slurmrestd enabled (API v0.0.41)
Go1.22+Only if building from source
psql-PostgreSQL client for running migrations

Installing Required Packages

Ubuntu/Debian:

sudo apt update
sudo apt install -y postgresql-client

RHEL/Rocky/AlmaLinux:

sudo dnf install -y postgresql

Arch Linux:

sudo pacman -S postgresql-libs

Note: You only need the PostgreSQL client (psql) on the machine running the ingestor. The PostgreSQL server can be on a different machine.


Quick Start

Choose one of the following installation methods:

git clone https://github.com/thediymaker/slurm-history-ingestor.git
cd slurm-history-ingestor
chmod +x setup.sh
./setup.sh

The script will interactively prompt for all configuration and optionally run database migrations.


Step 1: Clone and configure

git clone https://github.com/thediymaker/slurm-history-ingestor.git
cd slurm-history-ingestor
cp .env.example .env

Step 2: Edit .env with your values

nano .env  # or use your preferred editor

Step 3: Run database migrations

You must create the database tables before starting the ingestor. Choose based on your setup:

If using an existing PostgreSQL server:

psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/001_init.sql

If using Docker Compose with bundled PostgreSQL (uncomment postgres service in docker-compose.yml first):

# Start just the database first
docker compose up -d postgres

# Wait a few seconds for it to initialize, then run migrations
docker compose exec postgres psql -U slurm_user -d slurm_history -f /docker-entrypoint-initdb.d/001_init.sql

Step 4: Start the ingestor

docker compose up -d ingestor

Step 5: View logs

docker compose logs -f ingestor

Option C: Pre-built Binary (No Build Required)

Step 1: Download the binary

Download the latest release from GitHub Releases.

# Example for Linux x64
wget https://github.com/thediymaker/slurm-history-ingestor/releases/latest/download/slurm-ingestor-linux-amd64
chmod +x slurm-ingestor-linux-amd64
mv slurm-ingestor-linux-amd64 slurm-ingestor

Step 2: Run database migrations

# Download the migration file
wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/db/migrations/001_init.sql

# Apply to your database
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f 001_init.sql

Step 3: Configure environment

# Download the example config
wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/.env.example -O .env

# Edit with your values
nano .env

Step 4: Run the ingestor

./slurm-ingestor

Option D: Build from Source

git clone https://github.com/thediymaker/slurm-history-ingestor.git
cd slurm-history-ingestor

# Install sqlc and generate database code
go install github.com/sqlc-dev/sqlc/cmd/sqlc@latest
sqlc generate

# Build
go mod tidy
go build -o slurm-ingestor cmd/ingest/main.go

# Run migrations
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/001_init.sql

# Configure and run
cp .env.example .env
nano .env
./slurm-ingestor

Configuration Reference

All configuration is via environment variables or a .env file.

Required

VariableDescriptionExample
DATABASE_URLPostgreSQL connection stringpostgres://user:pass@localhost:5432/slurm_history?sslmode=disable
CLUSTER_NAMEUnique cluster identifierproduction-hpc

Ingest Mode

Choose one of two modes:

API Mode (default) - Uses Slurm REST API:

VariableDescription
INGEST_MODESet to api (default)
SLURM_SERVERSlurm REST API URL (e.g., http://slurm-head:6820)
SLURM_API_ACCOUNTSlurm user for API auth
SLURM_API_TOKENJWT token for auth
SLURM_API_VERSIONAPI version (default: v0.0.41)

Sacct Mode (recommended) - Uses sacct command directly:

VariableDescription
INGEST_MODESet to sacct
SACCT_PATHPath to sacct binary (default: sacct)

Note: Sacct mode is faster and more reliable. Requires running on a Slurm login node with sacct access.

Optional

VariableDefaultDescription
SYNC_INTERVAL300Seconds between syncs
INITIAL_SYNC_DATE2024-01-01How far back to sync on first run (YYYY-MM-DD)
CHUNK_HOURS24Hours per request chunk (lower = less data per request)
HTTP_TIMEOUT120Seconds to wait for API response (API mode only)
DEBUGfalseEnable verbose logging

Performance Tuning (API Mode)

If you experience timeout errors with API mode, adjust these settings:

ProblemSolution
"Timeout was reached"Reduce CHUNK_HOURS to 6, 3, or even 1
Slow networkIncrease HTTP_TIMEOUT to 300 or higher
Very busy clusterSwitch to INGEST_MODE=sacct (recommended)

The ingestor will automatically retry failed requests up to 3 times with exponential backoff.

Running as a Systemd Service

For production deployments, run as a systemd service:

1. Install the binary and config:

# Create installation directory
sudo mkdir -p /opt/slurm-ingestor

# Copy binary
sudo cp slurm-ingestor /opt/slurm-ingestor/
sudo chmod +x /opt/slurm-ingestor/slurm-ingestor

# Copy and edit your .env file
sudo cp .env /opt/slurm-ingestor/.env
sudo chmod 600 /opt/slurm-ingestor/.env  # Protect credentials

2. Create /etc/systemd/system/slurm-ingestor.service:

Using .env file (recommended):

[Unit]
Description=Slurm History Ingestor
After=network.target

[Service]
Type=simple
User=slurm
Group=slurm
WorkingDirectory=/opt/slurm-ingestor
ExecStart=/opt/slurm-ingestor/slurm-ingestor
Restart=on-failure
RestartSec=10

# Load configuration from .env file
EnvironmentFile=/opt/slurm-ingestor/.env

[Install]
WantedBy=multi-user.target

Or define variables directly:

[Unit]
Description=Slurm History Ingestor
After=network.target

[Service]
Type=simple
User=slurm
Group=slurm
ExecStart=/opt/slurm-ingestor/slurm-ingestor
Restart=on-failure
RestartSec=10

# Sacct mode configuration
Environment="INGEST_MODE=sacct"
Environment="DATABASE_URL=postgres://user:pass@db-host:5432/slurm_history"
Environment="CLUSTER_NAME=production-hpc"
Environment="SACCT_PATH=/usr/bin/sacct"

[Install]
WantedBy=multi-user.target

3. Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable --now slurm-ingestor
systemctl status slurm-ingestor

Storage Requirements

Approximately 500 bytes per job record:

JobsStorage
100,000~50 MB
1 million~500 MB
10 million~5 GB
100 million~50 GB

For large deployments (>1M jobs): use a dedicated PostgreSQL server and consider table partitioning by submit_time.


Troubleshooting

IssueSolution
Connection RefusedVerify SLURM_SERVER is reachable and slurmrestd is running
Authentication FailedCheck SLURM_API_ACCOUNT exists and token is valid
No Jobs SyncingEnable DEBUG=true, verify CLUSTER_NAME matches
Migration ErrorsTables may already exist - usually safe to ignore