# Training

This guide covers training models on CARLA data for CARLA Leaderboard.

> **Note**: For training on cross-datasets (NavSim, Waymo), see [Cross-Dataset Training](cross_dataset_training.md).

## Prerequisites

Each of the following steps has to be done only once.

### 1. Prepare Data

We will upload a dataset soon. Stay tuned! In the mean time, follow this [tutorial](https://ln2697.github.io/lead/docs/data_collection.html) to collect data.

If data collected locally, run

```bash
cp -r data/expert_debug data/carla_leaderboard2
```

### 2. Build Data Buckets

Buckets group training samples by characteristics (e.g., scenarios, towns, weather, scenarios, road curvature, etc.) to enable curriculum learning and balanced batch sampling.

By default we use [full_pretrain_bucket_collection](https://github.com/autonomousvision/lead/blob/main/lead/training/data_loader/buckets/full_pretrain_bucket_collection.py) for pre-training and [full_posttrain_bucket_collection](https://github.com/autonomousvision/lead/blob/main/lead/training/data_loader/buckets/full_posttrain_bucket_collection.py) for post-training, e.g., we train uniformly on all samples.

Buckets are built once and stored on disk in the dataset directory. In subsequents runs they are reused automatically. This is neccessary to save time.

To build pretrain bucket, run

```bash
python3 scripts/build_buckets_pretrain.py
```

To build posttrain bucket, run

```bash
python3 scripts/build_buckets_posttrain.py
```

If everything is ok, this should be the output

```html
data/carla_leaderboard2
├── buckets
│   ├── full_posttrain_buckets_8_8_8_5.gz
│   └── full_pretrain_buckets.gz
├── data
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45
└── results
    └── Town06_13_result.json
```

The bucket files can be used on other computers since file paths in each bucket is relative.

**Note:** A bucket contains all and only the paths of data samples that are available at bucket building time. If you later add or delete routes, you need to rebuild the buckets.

### 3. Build Persistent Data Cache

Raw sensor data (images, LiDAR, RADAR, etc.) requires significant preprocessing before training - decompression, format conversion, and perturbation alignment. The training cache stores preprocessed and compressed data to disk, eliminating redundant computation and dramatically speeding up data loading. Once built, the cache is reused across training runs, reducing the data loading bottleneck.

Two types of cache are used:
- **`persistent_cache`**: Stored alongside the dataset, reused across all training sessions. See implementation at [PersistentCache](https://github.com/autonomousvision/lead/blob/main/lead/training/data_loader/carla_dataset_utils.py).
- **`training_session_cache`**: Temporary cache on local SSD of a cluster job. We use [diskcache](https://pypi.org/project/diskcache/) for this purpose.

To build cache, run

```bash
python3 scripts/build_cache.py
```

If everything is ok, this should be the output

```html
data/carla_leaderboard2
├── buckets
│   ├── full_posttrain_buckets_8_8_8_5.gz
│   └── full_pretrain_buckets.gz
├── cache
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45
├── data
│   └── BlockedIntersection
│       └── 999_Rep-1_Town06_13_route0_12_22_22_34_45
└── results
    └── Town06_13_result.json
```

**Note:** After changing something in pipeline (e.g., add new semantic class), you might need to check whether the cache needs to be rebuilt.

**Note:** After building data cache, the pipeline only needs the meta files in `data/carla_leaderboard2/data`, everything else can be deleted.

## Model Training

Following standard procedures on CARLA, we train the model in two phases, first only the perception backbone is trained, only after that we train everything jointly..

### 1. Perception pre-training

```bash
bash scripts/pretrain.sh
```

The training will takes around 1-2 minutes and produces following structure

```html
outputs/local_training/pretrain
├── clipper_0030.pth
├── config.json
├── events.out.tfevents.1764250874.local.105366.0
├── gradient_steps_skipped_0030.txt
├── model_0030.pth
├── optimizer_0030.pth
├── scaler_0030.pth
└── scheduler_0030.pth
```

To debug training, the script also regulary produces WandB/TensorBoard logs and images at `outputs/training_viz`. The frequency can be controlled with `log_scalars_frequency` and `log_images_frequency`.

The image logging could be quite expensive, it runs at least once per epoch. To turn if off completely, set `visualize_training=false` in training config.

To observe the training logging with TensorBoard, run

```bash
tensorboard --logdir outputs/local_training/pretrain
```

We also support WandB, to turn it on, set `log_wandb=true` in training config.

### 2. Post-training

> **Note**: The epoch count will be reset back to 0.

After pre-training, we continue with the post-training where we put the planner on top of the model
and train the whole model end-to-end.

```bash
bash scripts/posttrain.sh
```

### [Optional] Resume failed training

To continue a failed training, set `continue_failed_training=true`.

### [Optional] Distributed Training

The pipeline supports [Torch DDP](https://docs.pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html). An example:

```bash
torchrun --standalone --nnodes=1 --nproc_per_node=4 --max_restarts=0 --rdzv_backend=c10d python3 lead/training/train.py
```

## Common issues

### CARLA server running

A common error might happen with following error message
```bash
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
```

It might come from CARLA server running in background and eating vram. To kill CARLA, run

```bash
bash scripts/clean_carla.sh
```