3 Training with SLURM

Note

This is a completely optional feature. The SLURM integration is designed for users with access to HPC clusters who want to scale their experiments efficiently. All functionality can also be run locally without SLURM.

With your dataset collected, the next step is training models. The SLURM wrapper handles pre-training, fine-tuning, and multi-seed experiments with organized output directories and checkpoint resumption.

3.1 A Complete Training Pipeline

A full example pipeline is available at slurm/experiments/001_example, demonstrating the workflow from initial training to evaluation-ready models.

3.2 Step 1: Pre-training From Scratch

Train your perception backbone on the collected data:

1bash slurm/experiments/001_example/000_pretrain1_0.sh

This starts a pre-training job with random seed 0 (indicated by the last digit in the script name). Output is automatically organized:

outputs/training/001_example/000_pretrain1_0/<year><month><day>_<hour><minute><second>

Checkpoints, logs, and metrics are saved to this directory for tracking training progress.

3.3 Step 2: Post-training (Fine-tuning)

After pre-training completes, fine-tune by adding components like a planning decoder. Start post-training with seed 2:

1bash slurm/experiments/001_example/012_postrain32_2.sh

The script structure:

1#!/usr/bin/bash
2
3source slurm/init.sh
4
5export LEAD_TRAINING_CONFIG="$LEAD_TRAINING_CONFIG image_architecture=regnety_032 lidar_architecture=regnety_032"   # Same as in pre-training
6export LEAD_TRAINING_CONFIG="$LEAD_TRAINING_CONFIG use_planning_decoder=true"                                       # Add a planner on top
7posttrain outputs/training/001_example/000_pretrain1_0/251018_092144                                                # Load the pre-trained weights
8
9train --cpus-per-task=64 --partition=L40Sday --time=4-00:00:00 --gres=gpu:4

Key points:

  • Lines 5-6: Model architecture configuration, consistent with pre-training plus new components

  • Line 7: The posttrain function (defined in slurm/init.sh) helps the training script with loading the pre-trained weights

  • Line 9: SLURM parameters for the fine-tuning job

3.4 Handling Crashes: Automatic Resume

Training jobs crash due to hardware failures, time limits, or OOM errors. Recovery requires adding one line:

 1#!/usr/bin/bash
 2
 3source slurm/init.sh
 4
 5export LEAD_TRAINING_CONFIG="$LEAD_TRAINING_CONFIG image_architecture=regnety_032 lidar_architecture=regnety_032"
 6export LEAD_TRAINING_CONFIG="$LEAD_TRAINING_CONFIG use_planning_decoder=true"
 7posttrain outputs/training/001_example/000_pretrain1_0/251018_092144
 8resume outputs/training/001_example/012_postrain32_2/251018_092144                                                  # Resume from this checkpoint
 9
10train --cpus-per-task=64 --partition=L40Sday --time=4-00:00:00 --gres=gpu:4

The resume function loads the latest checkpoint from the specified directory. Restart training by running the same script without modifications.

3.5 What’s Next?

With trained models, proceed to:

  • Evaluation: Test models across different scenarios and benchmark performance