Slurm History Ingestor

Overview

The Slurm History Ingestor is a standalone Go service that syncs job history from Slurm HPC clusters into PostgreSQL for analytics and reporting.

Key Features:

  • Incremental Syncing – Only fetches new jobs since last sync
  • Robust Data Handling – Uses lookback window to catch out-of-order jobs
  • Data Normalization – Converts TRES strings to query-friendly columns
  • Multi-Cluster Support – Tag records with cluster name for centralized databases

Quick Start

The fastest way to get started is with the interactive setup script:

git clone https://github.com/thediymaker/slurm-history-ingestor.git
cd slurm-history-ingestor
chmod +x setup.sh
./setup.sh

The script will guide you through:

  • Choosing between sacct mode (recommended) or API mode
  • Database configuration and running all 3 migrations
  • Environment variable setup (.env file creation)
  • Building the binary from source
  • Optional systemd service installation

What you'll need before running the script:

  • PostgreSQL 13+ (can be remote)
  • Slurm 20.11+ with sacct command access
  • PostgreSQL client (psql) installed locally
  • Go 1.22+ (for building from source)

Installation takes about 5 minutes with the interactive script. If you prefer not to build from source or want more control, see the manual installation below.


Prerequisites

Database Requirements

ComponentVersionNotes
PostgreSQL Server13+Can be on a different machine
PostgreSQL ClientAnyFor running migrations (psql command)

Slurm Requirements

ComponentVersionNotes
Slurm20.11+Must have sacct command available
Access LevelUserService must run as user with sacct permissions

Build Requirements (for setup.sh script)

The interactive setup script builds from source, so you'll need:

  • Go 1.22+
  • sqlc (automatically installed by the script)

Installing PostgreSQL Client

Install psql on the machine where you'll run the ingestor:

sudo apt update && sudo apt install -y postgresql-client

Installation (Manual Method)

For production deployments or if you prefer not to build from source, use the pre-built binary and follow these manual steps.

Note:

Setup Script vs Manual Installation:

  • Setup Script (setup.sh) – Builds from source, runs all migrations, and can optionally set up systemd. Best for development or if you want everything automated.
  • Manual Installation – Uses pre-built binary, gives you full control over each step. Best for production deployments.

Step 1: Download the Binary

# Create installation directory
sudo mkdir -p /opt/slurm-ingestor
cd /opt/slurm-ingestor

# Download the latest release (Linux x64)
sudo wget https://github.com/thediymaker/slurm-history-ingestor/releases/latest/download/slurm-ingestor-linux-amd64 -O slurm-ingestor
sudo chmod +x slurm-ingestor

Step 2: Set Up the Database

Apply all three migrations in order. Choose one of these methods:

Method A: Using local files (if you cloned the repo)

psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/001_init.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/002_add_gpu_fields.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/003_add_gpu_metrics.sql

Method B: Download migrations directly

wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/db/migrations/001_init.sql
wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/db/migrations/002_add_gpu_fields.sql
wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/db/migrations/003_add_gpu_metrics.sql

psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f 001_init.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f 002_add_gpu_fields.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f 003_add_gpu_metrics.sql

Step 3: Configure the Service

Get the configuration template and customize it:

# Download the example config (or copy from repo if cloned)
sudo wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/.env.example -O .env

# Edit with your database and cluster information
sudo vim .env

# Secure the file
sudo chmod 600 .env

Minimal configuration example:

# Database connection
DATABASE_URL=postgres://slurm_user:yourpassword@db-host:5432/slurm_history?sslmode=disable

# Cluster identification
CLUSTER_NAME=production-hpc

# Use sacct mode (recommended)
INGEST_MODE=sacct
SACCT_PATH=/usr/bin/sacct

# Sync every 5 minutes
SYNC_INTERVAL=300

# Start syncing from this date on first run
INITIAL_SYNC_DATE=2024-01-01

Step 4: Create the Systemd Service

Create /etc/systemd/system/slurm-ingestor.service:

sudo vim /etc/systemd/system/slurm-ingestor.service

Add this content:

[Unit]
Description=Slurm History Ingestor
After=network.target remote-fs.target munge.service
Requires=remote-fs.target munge.service

[Service]
Type=simple
User=slurm
Group=slurm
WorkingDirectory=/opt/slurm-ingestor
ExecStart=/opt/slurm-ingestor/slurm-ingestor
Restart=on-failure
RestartSec=10
EnvironmentFile=/opt/slurm-ingestor/.env

[Install]
WantedBy=multi-user.target

Step 5: Start the Service

# Reload systemd to recognize the new service
sudo systemctl daemon-reload

# Enable and start the service
sudo systemctl enable --now slurm-ingestor

# Check that it's running
systemctl status slurm-ingestor

# Watch the logs
journalctl -u slurm-ingestor -f

Configuration Reference

Configure the ingestor using environment variables in your .env file.

Required Settings

VariableDescriptionExample
DATABASE_URLPostgreSQL connection stringpostgres://user:pass@localhost:5432/slurm_history?sslmode=disable
CLUSTER_NAMEUnique identifier for this clusterproduction-hpc

Ingest Mode Selection

Choose sacct mode (recommended) or API mode:

Sacct Mode – Direct command-line interface (faster and more reliable)

VariableDescriptionDefault
INGEST_MODESet to sacct-
SACCT_PATHPath to sacct binarysacct

Note:

Sacct mode requires running on a Slurm node with sacct command access. This is the recommended mode for production.

API Mode – REST API interface (for remote access)

VariableDescriptionExample
INGEST_MODESet to api-
SLURM_SERVERSlurm REST API URLhttp://slurm-head:6820
SLURM_API_ACCOUNTUsername for API authenticationslurm_api_user
SLURM_API_TOKENJWT token for authenticationeyJhbGc...
SLURM_API_VERSIONAPI version stringv0.0.41

Sync Behavior

VariableDefaultDescription
SYNC_INTERVAL300Seconds between sync cycles
INITIAL_SYNC_DATE2024-01-01Start date for first sync (YYYY-MM-DD format)
CHUNK_HOURS24Hours of data to fetch per request

Advanced Options

VariableDefaultDescription
HTTP_TIMEOUT120API request timeout in seconds (API mode only)
DEBUGfalseEnable verbose logging for troubleshooting

Alternative Installation Methods

Docker Deployment

Note:

Docker is primarily useful for API mode or when you can mount the sacct binary into the container. For most deployments, the binary + systemd approach is simpler and more reliable.

Step 1: Clone and configure

git clone https://github.com/thediymaker/slurm-history-ingestor.git
cd slurm-history-ingestor
cp .env.example .env
vim .env

Step 2: Run database migrations

If using an external PostgreSQL server:

psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/001_init.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/002_add_gpu_fields.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/003_add_gpu_metrics.sql

If using Docker Compose with bundled PostgreSQL:

docker compose up -d postgres
docker compose exec postgres psql -U slurm_user -d slurm_history -f /docker-entrypoint-initdb.d/001_init.sql
docker compose exec postgres psql -U slurm_user -d slurm_history -f /docker-entrypoint-initdb.d/002_add_gpu_fields.sql
docker compose exec postgres psql -U slurm_user -d slurm_history -f /docker-entrypoint-initdb.d/003_add_gpu_metrics.sql

Step 3: Start the ingestor

docker compose up -d ingestor
docker compose logs -f ingestor

Building from Source

For development or custom builds:

# Clone the repository
git clone https://github.com/thediymaker/slurm-history-ingestor.git
cd slurm-history-ingestor

# Generate database code
go install github.com/sqlc-dev/sqlc/cmd/sqlc@latest
sqlc generate

# Build the binary
go mod tidy
go build -o slurm-ingestor cmd/ingest/main.go

# Run migrations
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/001_init.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/002_add_gpu_fields.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/003_add_gpu_metrics.sql

# Configure and run
cp .env.example .env
vim .env
./slurm-ingestor

Storage Planning

Each job record uses approximately 500 bytes of storage:

Job CountEstimated Storage
100,000~50 MB
1 million~500 MB
10 million~5 GB
100 million~50 GB

Recommendations for large deployments:

  • Use a dedicated PostgreSQL server for >1M jobs
  • Consider table partitioning by submit_time for >10M jobs
  • Plan for index overhead (roughly 30-40% of table size)

Troubleshooting

Common Issues

ProblemSolution
Connection RefusedVerify SLURM_SERVER URL is correct and slurmrestd is running
Authentication FailedCheck that SLURM_API_ACCOUNT exists and token is valid
No Jobs SyncingEnable DEBUG=true and verify CLUSTER_NAME matches your cluster
Migration ErrorsTables may already exist from previous setup – usually safe to ignore
Sacct Permission DeniedEnsure the service user has permissions to run sacct
Service Won't StartCheck journalctl -u slurm-ingestor -xe for detailed error messages

Debugging Tips

  1. Enable debug logging:

    # Add to .env file
    DEBUG=true
    
    # Restart service
    sudo systemctl restart slurm-ingestor
    
  2. Watch logs in real-time:

    journalctl -u slurm-ingestor -f
    
  3. Test database connection:

    psql $DATABASE_URL -c "SELECT COUNT(*) FROM slurm_jobs;"
    
  4. Test sacct access:

    sudo -u slurm sacct --starttime now-1hour --format=JobID,JobName,State
    

Getting Help

If you encounter issues not covered here:

  • Check the GitHub Issues
  • Review systemd logs: journalctl -u slurm-ingestor -n 100
  • Verify your .env configuration matches the examples above