Training
Note: This is a completely optional feature. The SLURM integration is designed for users with access to HPC clusters who want to scale their experiments efficiently. All functionality can also be run locally without SLURM.
Overview
We have a complete example pipeline for training and evaluating model at slurm/experiments/001_example.
Pre-training
Start pre-training with
bash slurm/experiments/001_example/000_pretrain1_0.sh
This will create a pre-training session at outputs/training/001_example/000_pretrain1_0/<year><month><day>_<hour><minute><second>, setting the training seed as 0.
Post-training
After pre-training, we start the post-training with training seed set to 2, as indicated by the last digit of the script
bash slurm/experiments/001_example/012_postrain32_2.sh
Its content explained:
#!/usr/bin/bash
source slurm/init.sh
export LEAD_TRAINING_CONFIG="$LEAD_TRAINING_CONFIG image_architecture=regnety_032 lidar_architecture=regnety_032" # Same as in pre_training
export LEAD_TRAINING_CONFIG="$LEAD_TRAINING_CONFIG use_planning_decoder=true" # Add a planner on top
posttrain outputs/training/001_example/000_pretrain1_0/251018_092144 # Specify where the pre-trained model is at
train --cpus-per-task=64 --partition=L40Sday --time=4-00:00:00 --gres=gpu:4
Resume crashed training
As can be seen in slurm/init.sh, when a training crashes, we can restart it easily by adding a simple line to slurm/experiments/001_example/012_postrain32_2.sh
#!/usr/bin/bash
source slurm/init.sh
export LEAD_TRAINING_CONFIG="$LEAD_TRAINING_CONFIG image_architecture=regnety_032 lidar_architecture=regnety_032" # Same as in pre_training
export LEAD_TRAINING_CONFIG="$LEAD_TRAINING_CONFIG use_planning_decoder=true" # Add a planner on top
posttrain outputs/training/001_example/000_pretrain1_0/251018_092144 # Specify where the pre-trained model is at
resume outputs/training/001_example/012_postrain32_2/251018_092144 # Specify the training directory of the post-training session
train --cpus-per-task=64 --partition=L40Sday --time=4-00:00:00 --gres=gpu:4
Now, you can restart the training as many times as needed without any other change