2 Data Collection with SLURM

Note

This is a completely optional feature. The SLURM integration is designed for users with access to HPC clusters who want to scale their experiments efficiently. All functionality can also be run locally without SLURM.

Before training models, data collection is required. The data collection system orchestrates CARLA simulation jobs across the cluster, handling job submission, failure monitoring, and automatic resubmission of crashed routes.

2.1 Preparing Your Data Collection Run

Configure the following settings in slurm/data_collection/collect_data.py:

  • repetitions: Number of times to run each route with different random seeds (for data diversity)

  • partitions: Cluster partition names (e.g., gpu-2080ti, a100-galvani)

  • dataset_name: Descriptive name for the dataset (e.g., carla_leaderboard2_train)

Using Py123D Data Format:

The --py123d flag enables collection in Py123D format, which provides a unified data representation compatible with other major autonomous driving datasets. This format is useful for:

  • Cross-dataset training and evaluation

  • Combining CARLA data with real-world datasets

  • Standardized data processing pipelines

When using --py123d:

  • Expert agent automatically switches to expert_py123d.py

  • Dataset name becomes carla_leaderboard2_py123d

Two additional config files control the collection process:

2.2 Launching the Collection

Log into the cluster login node and start collection:

 1# Standard LEAD format (default)
 2python3 slurm/data_collection/collect_data.py
 3
 4# Py123D format for cross-dataset compatibility
 5python3 slurm/data_collection/collect_data.py --py123d
 6
 7# Optional: specify custom route and data folders
 8python3 slurm/data_collection/collect_data.py \
 9  --route_folder data/custom_routes \
10  --root_folder /scratch/datasets/
11
12# Py123D with custom folders
13python3 slurm/data_collection/collect_data.py \
14  --py123d \
15  --route_folder data/custom_routes \
16  --root_folder /scratch/datasets/

The script creates a structured output directory:

data/carla_leaderboard2
├── data     # Sensor data storage
├── results  # Results JSON files
├── scripts  # Generated SLURM bash scripts
├── stderr   # SLURM stderr logs
└── stdout   # SLURM stdout logs

Note: Data collection can take up to 2 days on 90 GPUs for 9000 routes. Run the script inside screen or tmux to prevent interruption from SSH disconnections.

2.3 Monitoring Your Collection

Check collection progress and identify failures:

1python3 slurm/data_collection/print_collect_data_progress.py

Update the root variable in the script to point to your data directory.

Note

Failure rates below 10% are typical and primarily caused by simulation crashes or hardware issues. Some scenario types may exhibit higher failure rates (around 50%), which indicates limitations in the expert’s policy for those specific situations. This is expected behavior—as long as most scenarios maintain failure rates below 10%, the dataset quality remains sufficient for training.

2.4 Cleaning Up Failed Routes

Remove corrupted or incomplete data after collection completes:

1python3 slurm/data_collection/delete_failed_routes.py

This filters the dataset to only successfully collected routes.

Warning

This cleanup step is optional. The training pipeline filters out failed routes automatically. Examining failed routes can reveal expert policy biases and data collection issues.

2.5 What’s Next?

With collected data, proceed to:

  • Training: Train models using the collected data with automatic checkpointing and multi-seed support

  • Evaluation: Test trained models across scenarios and benchmark performance

The SLURM wrapper maintains consistent organization throughout the pipeline.