Evaluation

This guide covers basic local evaluation for getting started. For large-scale evaluation (running many routes parallely), see the SLURM Evaluation Guide.

Overview

We provided in the Quick Start tutorial how to evaluate a trained policy on a single Bench2Drive route. Currently we support scripts to evaluate the three most popular benchmarks locally:

Running Evaluations

1. Start CARLA Server

Before running any evaluation, start the CARLA server:

bash scripts/start_carla.sh

2. Customize Checkpoint and Route

Each evaluation script contains these key variables you can modify:

export BENCHMARK_ROUTE_ID=23687  # Route ID to evaluate
export CHECKPOINT_DIR=outputs/checkpoints/tfv6_resnet34/  # Path to model checkpoint

For Bench2Drive, route IDs range from 0-219. For Longest6, use 00-35. For Town13, use 0-19.

3. Configuration Options

Evaluation behavior is controlled by config_closed_loop.py. Key settings:

  • produce_demo_video - Generate bird’s-eye view visualization videos

  • produce_debug_video - Generate detailed debug videos with sensor data

  • produce_demo_image - Save individual demo frames

  • produce_debug_image - Save individual debug frames

Turn off video generation for faster evaluation:

# In your environment or config override
produce_demo_video = False
produce_debug_video = False

The evaluation configuration can be changed for each progress individually with the environment variable LEAD_CLOSED_LOOP_CONFIG.

4. Output Structure

Each evaluation creates:

outputs/local_evaluation/<route_id>/
├── checkpoint_endpoint.json      # Metrics and results
├── metric_info.json              # Detailed evaluation metrics
├── demo_images/                  # Bird's-eye view frames
├── debug_images/                 # Debug visualization frames
└── debug_checkpoint/             # Debug checkpoints

If video generation is enabled:

outputs/local_evaluation/
├── <route_id>_demo.mp4          # Bird's-eye view video
└── <route_id>_debug.mp4         # Debug video with sensor data

Summarize Leaderboard 2.0 Results for Longest6 v2 and Town13

After completing all routes in a benchmark, aggregate results using the result parser:

python3 scripts/tools/result_parser.py \
    --xml data/benchmark_routes/bench2drive220.xml \
    --results outputs/local_evaluation/

This generates a summary CSV with:

  • Driving score

  • Route completion percentage

  • Infraction breakdown (collisions, traffic violations, etc.)

  • Per-kilometer statistics

Summarize Bench2Drive Results

Bench2Drive provides some more metrics beyond the official Leaderboard 2.0 metrics of Longest6 v2 and Town13. See official guide.

The tools for Bench2Drive in our repo can be found here.

Best Practices

Turn off production of videos and images for Longest6 v2 and Town13: With enough compute (16-32 GTX 1080ti), evaluation can take up to 1 day for 3 seeds on Longest6 v2 and up to 2 days for 3 seeds on Town13. About 90% of the routes will be finished within a few hours.

Restart CARLA between routes: Running multiple routes on the same CARLA instance can lead to rendering bugs (see image, taken from Bench2Drive paper).

bash scripts/clean_carla.sh  # Kill CARLA processes
bash scripts/start_carla.sh  # Restart fresh instance

Memory management: By default, the pipeline loads all three checkpoint seeds as an ensemble. If memory is limited, rename two of the checkpoint files so only one seed loads.

Use correct leaderboard and scenario_runner: Longest6 v2 and Town13 should be evaluated on the normal leaderboard setup. Bench2Drive must be evaluated on code of their repo, otherwise the results are not valid.

Evaluation variance: CARLA is highly stochastic, even with fixed seeds. Results can vary significantly between runs due to traffic randomness and other non-deterministic factors. Our recommended evaluation protocol:

  • Minimum (standard practice): Train 3 models with different seeds, evaluate each once → 3 evaluation runs total

  • Optimal (for publications): Train 3 models with different seeds, evaluate each 3 times → 9 evaluation runs total

We use the minimum protocol in our group.