Training

This guide covers training models on CARLA data for CARLA Leaderboard.

Note: For training on cross-datasets (NavSim, Waymo), see Cross-Dataset Training.

Prerequisites

Each of the following steps has to be done only once.

1. Prepare Data

We will upload a dataset soon. Stay tuned! In the mean time, follow this tutorial to collect data.

If data collected locally, run

cp -r data/expert_debug data/carla_leaderboard2

2. Build Data Buckets

Buckets group training samples by characteristics (e.g., scenarios, towns, weather, scenarios, road curvature, etc.) to enable curriculum learning and balanced batch sampling.

By default we use full_pretrain_bucket_collection for pre-training and full_posttrain_bucket_collection for post-training, e.g., we train uniformly on all samples.

Buckets are built once and stored on disk in the dataset directory. In subsequents runs they are reused automatically. This is neccessary to save time.

To build pretrain bucket, run

python3 scripts/build_buckets_pretrain.py

To build posttrain bucket, run

python3 scripts/build_buckets_posttrain.py

If everything is ok, this should be the output

data/carla_leaderboard2
├── buckets
│   ├── full_posttrain_buckets_8_8_8_5.gz
│   └── full_pretrain_buckets.gz
├── data
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45
└── results
    └── Town06_13_result.json

The bucket files can be used on other computers since file paths in each bucket is relative.

Note: A bucket contains all and only the paths of data samples that are available at bucket building time. If you later add or delete routes, you need to rebuild the buckets.

3. Build Persistent Data Cache

Raw sensor data (images, LiDAR, RADAR, etc.) requires significant preprocessing before training - decompression, format conversion, and perturbation alignment. The training cache stores preprocessed and compressed data to disk, eliminating redundant computation and dramatically speeding up data loading. Once built, the cache is reused across training runs, reducing the data loading bottleneck.

Two types of cache are used:

  • persistent_cache: Stored alongside the dataset, reused across all training sessions. See implementation at PersistentCache.

  • training_session_cache: Temporary cache on local SSD of a cluster job. We use diskcache for this purpose.

To build cache, run

python3 scripts/build_cache.py

If everything is ok, this should be the output

data/carla_leaderboard2
├── buckets
│   ├── full_posttrain_buckets_8_8_8_5.gz
│   └── full_pretrain_buckets.gz
├── cache
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45
├── data
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45
└── results
    └── Town06_13_result.json

Note: After changing something in pipeline (e.g., add new semantic class), you might need to check whether the cache needs to be rebuilt.

Note: After building data cache, the pipeline only needs the meta files in data/carla_leaderboard2/data, everything else can be deleted.

Model Training

Following standard procedures on CARLA, we train the model in two phases, first only the perception backbone is trained, only after that we train everything jointly..

1. Perception pre-training

bash scripts/pretrain.sh

The training will takes around 1-2 minutes and produces following structure

outputs/local_training/pretrain
├── clipper_0030.pth
├── config.json
├── events.out.tfevents.1764250874.local.105366.0
├── gradient_steps_skipped_0030.txt
├── model_0030.pth
├── optimizer_0030.pth
├── scaler_0030.pth
└── scheduler_0030.pth

To debug training, the script also regulary produces WandB/TensorBoard logs and images at outputs/training_viz. The frequency can be controlled with log_scalars_frequency and log_images_frequency.

The image logging could be quite expensive, it runs at least once per epoch. To turn if off completely, set visualize_training=false in training config.

To observe the training logging with TensorBoard, run

tensorboard --logdir outputs/local_training/pretrain

We also support WandB, to turn it on, set log_wandb=true in training config.

2. Post-training

Note: The epoch count will be reset back to 0.

After pre-training, we continue with the post-training where we put the planner on top of the model and train the whole model end-to-end.

bash scripts/posttrain.sh

[Optional] Resume failed training

To continue a failed training, set continue_failed_training=true.

[Optional] Distributed Training

The pipeline supports Torch DDP. An example:

torchrun --standalone --nnodes=1 --nproc_per_node=4 --max_restarts=0 --rdzv_backend=c10d python3 lead/training/train.py

Common issues

CARLA server running

A common error might happen with following error message

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

It might come from CARLA server running in background and eating vram. To kill CARLA, run

bash scripts/clean_carla.sh