2 Training (10 minutes)

This guide covers local model training for CARLA Leaderboard.

Note

This tutorial is specific to training policies for CARLA Leaderboard. For cross-dataset training (NAVSIM, Waymo), see Cross-dataset training. For large-scale training on HPC clusters, refer to the SLURM Training Guide.

2.1 Prerequisites

The following steps need to be performed once per dataset.

2.1.1 Prepare Data

This step assumes you either

  1. Fellow the data collection tutorial to collect data locally

  2. Or downloaded the data from Hugging Face.

Eitherway, your data structure should match:

data/carla_leaderboard2
├── data
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45

2.1.2 Build Data Buckets

Buckets group training samples by characteristics (scenarios, towns, weather, road curvature, etc.) to enable curriculum learning and balanced batch sampling.

Default bucket collections:

Buckets are built once, stored on disk, and automatically reused in subsequent runs.

Build pre-training buckets:

1python3 scripts/build_buckets_pretrain.py

Build post-training buckets:

1python3 scripts/build_buckets_posttrain.py

Expected output structure:

data/carla_leaderboard2
├── buckets
│   ├── full_posttrain_buckets_8_8_8_5.gz
│   └── full_pretrain_buckets.gz
├── data
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45

Bucket files use relative paths and are portable across machines.

Note

Buckets contain only the data samples available at build time. Rebuild buckets after adding or removing routes.

Note

If you encounter hard-to-debug errors after major code changes, try deleting and rebuilding the buckets.

2.1.3 Build Persistent Data Cache

Raw sensor data (images, LiDAR, RADAR) requires preprocessing—decompression, format conversion, and temporal alignment. The training cache stores preprocessed, compressed data on disk, eliminating redundant computation and accelerating data loading. Once built, the cache is reused across all training runs.

Two cache types:

  • persistent_cache: Stored alongside the dataset, reused across all training sessions. See PersistentCache

  • training_session_cache: Temporary cache on local SSD during cluster jobs, implemented with diskcache. During the first few epochs, data is loaded from shared storage and cached on the job’s local SSD for faster subsequent access. This implementation is specific to our organization’s SLURM cluster setup.

Build the persistent cache:

1python3 scripts/build_cache.py

Expected output structure:

data/carla_leaderboard2
├── buckets
│   ├── full_posttrain_buckets_8_8_8_5.gz
│   └── full_pretrain_buckets.gz
├── cache
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45
├── data
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45

Training session cache is loaded in the first few epochs of the training from shared disk and stored on training session’s disk. This implementation is specific to SLURM clusters of our organization.

Note

Rebuild the cache after modifying the data pipeline (e.g., adding new semantic classes or changing preprocessing steps).

2.2 Perception Pre-training

Following standard TransFuser-like training procedures, training occurs in two phases: first, train only the perception backbone, then train the complete model end-to-end.

1python3 lead/training/train.py logdir=outputs/local_training/pretrain

Training takes approximately 1-2 minutes if trained with only one route and produces:

outputs/local_training/pretrain
├── config.json
├── events.out.tfevents.1764250874.local.105366.0
├── gradient_steps_skipped_0030.txt
├── model_0030.pth
├── optimizer_0030.pth
├── scaler_0030.pth
└── scheduler_0030.pth

The training script generates WandB/TensorBoard logs and visualization images at outputs/training_viz. Control logging frequency with log_scalars_frequency and log_images_frequency in lead/training/config_training.py.

Image logging runs at least once per epoch and can be expensive. Disable it by setting visualize_training=false in the config.

View training logs with TensorBoard:

1tensorboard --logdir outputs/local_training/pretrain

WandB logging is also supported. Enable it by setting log_wandb=true in the training config.

2.3 Planning Post-training

Note

During post-training, the epoch count resets to 0. The optimizer state is reinitialized because the planner was not included in the pre-training checkpoint.

After pre-training completes, continue with post-training to add the planner and train the complete model end-to-end:

1python3 lead/training/train.py logdir=outputs/local_training/posttrain load_file=outputs/local_training/pretrain/model_0030.pth use_planning_decoder=true

2.4 Resume Failed Training

To continue from a failed training run, set continue_failed_training=true in the training config.

2.5 Distributed Training

The pipeline supports Torch DDP. See example scripts:

2.6 Common Issues

2.6.1 CARLA Server Running in Background

This error may occur:

1RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

This typically indicates a CARLA server is consuming VRAM in the background. Kill all CARLA processes:

1bash scripts/clean_carla.sh

2.6.2 Unknown GPU Name: <gpu_name>

Register your GPU in the training configuration:

Step 1: Add your GPU name to the gpu_name function in lead/training/config_training.py.

Step 2: If your GPU supports bf16 (bfloat16), add it to both use_mixed_precision_training and use_gradient_scaler functions in the same file.

Why explicit registration? Mixed precision training (BF16) is opt-in rather than automatic. On some older GPUs like RTX 2080 Ti, BF16 can degrade training performance, so we require explicit configuration.

2.6.3 Unknown CARLA Root Path: <carla_root>

Register your CARLA dataset configuration in the target_dataset function in lead/training/config_training.py. Map your CARLA root path to the appropriate dataset configuration, which specifies the expected sensor setup and data format.

Why this design? This allows running training experiments while simultaneously collecting data with different sensor configurations. The target_dataset in lead/expert/config_expert.py controls which sensors are mounted on the expert vehicle during data collection.