Slurm History Ingestor
Overview
The Slurm History Ingestor is a standalone Go service that syncs job history from Slurm HPC clusters into PostgreSQL for analytics and reporting.
Key Features:
- Incremental Syncing – Only fetches new jobs since last sync
- Robust Data Handling – Uses lookback window to catch out-of-order jobs
- Data Normalization – Converts TRES strings to query-friendly columns
- Multi-Cluster Support – Tag records with cluster name for centralized databases
Quick Start
The fastest way to get started is with the interactive setup script:
git clone https://github.com/thediymaker/slurm-history-ingestor.git
cd slurm-history-ingestor
chmod +x setup.sh
./setup.sh
The script will guide you through:
- Choosing between sacct mode (recommended) or API mode
- Database configuration and running all 3 migrations
- Environment variable setup (.env file creation)
- Building the binary from source
- Optional systemd service installation
What you'll need before running the script:
- PostgreSQL 13+ (can be remote)
- Slurm 20.11+ with
sacctcommand access - PostgreSQL client (
psql) installed locally - Go 1.22+ (for building from source)
Installation takes about 5 minutes with the interactive script. If you prefer not to build from source or want more control, see the manual installation below.
Prerequisites
Database Requirements
| Component | Version | Notes |
|---|---|---|
| PostgreSQL Server | 13+ | Can be on a different machine |
| PostgreSQL Client | Any | For running migrations (psql command) |
Slurm Requirements
| Component | Version | Notes |
|---|---|---|
| Slurm | 20.11+ | Must have sacct command available |
| Access Level | User | Service must run as user with sacct permissions |
Build Requirements (for setup.sh script)
The interactive setup script builds from source, so you'll need:
- Go 1.22+
- sqlc (automatically installed by the script)
Installing PostgreSQL Client
Install psql on the machine where you'll run the ingestor:
sudo apt update && sudo apt install -y postgresql-client
Installation (Manual Method)
For production deployments or if you prefer not to build from source, use the pre-built binary and follow these manual steps.
Note:
Setup Script vs Manual Installation:
- Setup Script (
setup.sh) – Builds from source, runs all migrations, and can optionally set up systemd. Best for development or if you want everything automated. - Manual Installation – Uses pre-built binary, gives you full control over each step. Best for production deployments.
Step 1: Download the Binary
# Create installation directory
sudo mkdir -p /opt/slurm-ingestor
cd /opt/slurm-ingestor
# Download the latest release (Linux x64)
sudo wget https://github.com/thediymaker/slurm-history-ingestor/releases/latest/download/slurm-ingestor-linux-amd64 -O slurm-ingestor
sudo chmod +x slurm-ingestor
Step 2: Set Up the Database
Apply all three migrations in order. Choose one of these methods:
Method A: Using local files (if you cloned the repo)
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/001_init.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/002_add_gpu_fields.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/003_add_gpu_metrics.sql
Method B: Download migrations directly
wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/db/migrations/001_init.sql
wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/db/migrations/002_add_gpu_fields.sql
wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/db/migrations/003_add_gpu_metrics.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f 001_init.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f 002_add_gpu_fields.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f 003_add_gpu_metrics.sql
Step 3: Configure the Service
Get the configuration template and customize it:
# Download the example config (or copy from repo if cloned)
sudo wget https://raw.githubusercontent.com/thediymaker/slurm-history-ingestor/main/.env.example -O .env
# Edit with your database and cluster information
sudo vim .env
# Secure the file
sudo chmod 600 .env
Minimal configuration example:
# Database connection
DATABASE_URL=postgres://slurm_user:yourpassword@db-host:5432/slurm_history?sslmode=disable
# Cluster identification
CLUSTER_NAME=production-hpc
# Use sacct mode (recommended)
INGEST_MODE=sacct
SACCT_PATH=/usr/bin/sacct
# Sync every 5 minutes
SYNC_INTERVAL=300
# Start syncing from this date on first run
INITIAL_SYNC_DATE=2024-01-01
Step 4: Create the Systemd Service
Create /etc/systemd/system/slurm-ingestor.service:
sudo vim /etc/systemd/system/slurm-ingestor.service
Add this content:
[Unit]
Description=Slurm History Ingestor
After=network.target remote-fs.target munge.service
Requires=remote-fs.target munge.service
[Service]
Type=simple
User=slurm
Group=slurm
WorkingDirectory=/opt/slurm-ingestor
ExecStart=/opt/slurm-ingestor/slurm-ingestor
Restart=on-failure
RestartSec=10
EnvironmentFile=/opt/slurm-ingestor/.env
[Install]
WantedBy=multi-user.target
Step 5: Start the Service
# Reload systemd to recognize the new service
sudo systemctl daemon-reload
# Enable and start the service
sudo systemctl enable --now slurm-ingestor
# Check that it's running
systemctl status slurm-ingestor
# Watch the logs
journalctl -u slurm-ingestor -f
Configuration Reference
Configure the ingestor using environment variables in your .env file.
Required Settings
| Variable | Description | Example |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | postgres://user:pass@localhost:5432/slurm_history?sslmode=disable |
CLUSTER_NAME | Unique identifier for this cluster | production-hpc |
Ingest Mode Selection
Choose sacct mode (recommended) or API mode:
Sacct Mode – Direct command-line interface (faster and more reliable)
| Variable | Description | Default |
|---|---|---|
INGEST_MODE | Set to sacct | - |
SACCT_PATH | Path to sacct binary | sacct |
Note:
Sacct mode requires running on a Slurm node with sacct command access. This is the recommended mode for production.
API Mode – REST API interface (for remote access)
| Variable | Description | Example |
|---|---|---|
INGEST_MODE | Set to api | - |
SLURM_SERVER | Slurm REST API URL | http://slurm-head:6820 |
SLURM_API_ACCOUNT | Username for API authentication | slurm_api_user |
SLURM_API_TOKEN | JWT token for authentication | eyJhbGc... |
SLURM_API_VERSION | API version string | v0.0.41 |
Sync Behavior
| Variable | Default | Description |
|---|---|---|
SYNC_INTERVAL | 300 | Seconds between sync cycles |
INITIAL_SYNC_DATE | 2024-01-01 | Start date for first sync (YYYY-MM-DD format) |
CHUNK_HOURS | 24 | Hours of data to fetch per request |
Advanced Options
| Variable | Default | Description |
|---|---|---|
HTTP_TIMEOUT | 120 | API request timeout in seconds (API mode only) |
DEBUG | false | Enable verbose logging for troubleshooting |
Alternative Installation Methods
Docker Deployment
Note:
Docker is primarily useful for API mode or when you can mount the sacct binary into the container. For most deployments, the binary + systemd approach is simpler and more reliable.
Step 1: Clone and configure
git clone https://github.com/thediymaker/slurm-history-ingestor.git
cd slurm-history-ingestor
cp .env.example .env
vim .env
Step 2: Run database migrations
If using an external PostgreSQL server:
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/001_init.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/002_add_gpu_fields.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/003_add_gpu_metrics.sql
If using Docker Compose with bundled PostgreSQL:
docker compose up -d postgres
docker compose exec postgres psql -U slurm_user -d slurm_history -f /docker-entrypoint-initdb.d/001_init.sql
docker compose exec postgres psql -U slurm_user -d slurm_history -f /docker-entrypoint-initdb.d/002_add_gpu_fields.sql
docker compose exec postgres psql -U slurm_user -d slurm_history -f /docker-entrypoint-initdb.d/003_add_gpu_metrics.sql
Step 3: Start the ingestor
docker compose up -d ingestor
docker compose logs -f ingestor
Building from Source
For development or custom builds:
# Clone the repository
git clone https://github.com/thediymaker/slurm-history-ingestor.git
cd slurm-history-ingestor
# Generate database code
go install github.com/sqlc-dev/sqlc/cmd/sqlc@latest
sqlc generate
# Build the binary
go mod tidy
go build -o slurm-ingestor cmd/ingest/main.go
# Run migrations
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/001_init.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/002_add_gpu_fields.sql
psql -h YOUR_DB_HOST -U YOUR_DB_USER -d YOUR_DB_NAME -f db/migrations/003_add_gpu_metrics.sql
# Configure and run
cp .env.example .env
vim .env
./slurm-ingestor
Storage Planning
Each job record uses approximately 500 bytes of storage:
| Job Count | Estimated Storage |
|---|---|
| 100,000 | ~50 MB |
| 1 million | ~500 MB |
| 10 million | ~5 GB |
| 100 million | ~50 GB |
Recommendations for large deployments:
- Use a dedicated PostgreSQL server for >1M jobs
- Consider table partitioning by
submit_timefor >10M jobs - Plan for index overhead (roughly 30-40% of table size)
Troubleshooting
Common Issues
| Problem | Solution |
|---|---|
| Connection Refused | Verify SLURM_SERVER URL is correct and slurmrestd is running |
| Authentication Failed | Check that SLURM_API_ACCOUNT exists and token is valid |
| No Jobs Syncing | Enable DEBUG=true and verify CLUSTER_NAME matches your cluster |
| Migration Errors | Tables may already exist from previous setup – usually safe to ignore |
| Sacct Permission Denied | Ensure the service user has permissions to run sacct |
| Service Won't Start | Check journalctl -u slurm-ingestor -xe for detailed error messages |
Debugging Tips
-
Enable debug logging:
# Add to .env file DEBUG=true # Restart service sudo systemctl restart slurm-ingestor -
Watch logs in real-time:
journalctl -u slurm-ingestor -f -
Test database connection:
psql $DATABASE_URL -c "SELECT COUNT(*) FROM slurm_jobs;" -
Test sacct access:
sudo -u slurm sacct --starttime now-1hour --format=JobID,JobName,State
Getting Help
If you encounter issues not covered here:
- Check the GitHub Issues
- Review systemd logs:
journalctl -u slurm-ingestor -n 100 - Verify your
.envconfiguration matches the examples above