LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

1University of Tübingen, Tübingen AI Center 2NVIDIA Research
Code Documentation Supplementary arXiv

Video 1: Qualitative results in dense traffic, complex urban scenarios, and adverse weather conditions. Driving agent operates on roof-mounted cameras, LiDAR, and radar inputs.

Abstract

Simulators can generate virtually unlimited driving data, yet imitation learning policies still struggle to achieve robust closed-loop performance. We study how misalignment between privileged expert demonstrations and sensor-based student observations limits the effectiveness of imitation learning in CARLA. We further show that navigation intent, when introduced late or in isolation, limits the policy’s ability to balance goal following with scene understanding. By systematically reducing learner–expert asymmetries and revisiting how navigation intent is specified in end-to-end policies, we show that better alignment leads to substantially improved closed-loop driving performance.

Summary

Our study examines imitation learning for simulation-based end-to-end driving, following Learning by Cheating (Figure 1) where a privileged teacher with access to full world state produces demonstrations that a student, operating from sensory inputs alone, must imitate.

LBC

Figure 1: Learning by Cheating. Top: Privileged teacher. Bottom: Sensor-based student.

Key Observation 1: Strong task performance does not guarantee effective supervision. Figure 2 illustrates two mismatches that make strong expert demonstrations surprisingly hard to imitate. With visibility asymmetry, the expert reacts to actors that the student cannot see, producing behaviors that appear arbitrary from the student’s perspective. With uncertainty asymmetry, the expert relies on perfect state information (such as exact velocities), allowing aggressive maneuvers with little safety margin that the student cannot reproduce. Our approach mitigates both by limiting expert privileges and ensuring demonstrations are grounded in the student’s sensor view.

LEAD Summary

Figure 2: Left: expert behavior that depends on hidden information leads to confusing and unsafe demonstrations. Right: grounding the expert in the student’s sensors results in more consistent and safer behavior. The first row further demonstrates a positive side effect: since the expert now has less information, it makes the same mistake as student (taking an unsafe gap) and recover (negotiation) which can be beneficial for the student learning.

Key Observation 2: Navigating complex scenarios requires sufficient goal information. Figure 3 illustrates intent asymmetry, where the student must rely on sparse goal cues while the expert has access to dense route information. Without knowing what lies ahead, the policy reacts too late in challenging situations such as lane changes or turns, leading to avoidable failures. By conditioning on multiple goal points along the route, the policy gains foresight, anticipates upcoming maneuvers, and significantly reduces these failure modes.

LEAD Summary

Figure 3: Left: with sparse goal information, the policy lacks foresight, reacts only at the last moment which leads to oversteering and crash. Right: a clearer picture of the intended path allows the policy to position itself early and handle the maneuver cleanly.

Key Observation 3: How goals are integrated into the model affects driving behavior. Figure 4 shows that providing detailed guidance alone is not enough. When goal information is injected late and in isolation, the policy overreacts to individual goal points, to a degree that it is willing to follow trajectories toward unreasonable targets (Figure 4, left). By integrating goal information together with scene context throughout the network, the policy instead learns to respect road structure and safely ignore dangerous goal points (Figure 4, right).

LEAD Summary

Figure 4: Trajectories are color-matched to their target points. Left: Policy highly sensitive to location of goal points. Some goal points are physically unreachable given the road layout, but a poorly conditioned policy still attempts to follow them, causing the trajectory to break down and deviate from the road. Right: The policy interprets goal points in context, follows the road structure, and ignores goals that cannot be reached safely.

Quantitative Results

We evaluate TFv6 on the Bench2Drive benchmark (Table 1), which consists of 220 diverse routes across multiple CARLA towns with challenging weather conditions and dense traffic scenarios.

Method Driving Score Success Rate Merge Overtake Emergency Brake Give Way Traffsign
TF++ (TFv5) 84.21 67.27 58.75 57.77 83.33 40.00 82.11
SimLingo 85.07 67.27 54.01 57.04 88.33 53.33 82.45
R2SE 86.28 69.54 53.33 61.25 90.00 50.00 84.21
HiP-AD 86.77 69.09 50.00 84.44 83.33 40.00 72.10
BridgeDrive 86.87 72.27 63.50 57.77 83.33 40.00 82.11
DiffRefiner 87.10 71.40 63.80 60.00 85.00 50.00 86.30
TFv6 (Ours) 95.28 86.80 72.50 97.77 91.66 40.00 89.47

Table 1: Quantitative results on Bench2Drive benchmark. Higher is better for all metrics.

Qualitative Results

We present qualitative examples that illustrate how the identified failure modes manifest in practice, and how our proposed fixes change the resulting behavior. Red rows show sub-optimal behavior before alignment, while green rows show the corrected behavior after applying our approach.

Qualitative Result 1

Figure 5: Top: The policy reacts too late to the sharp turn, misses the exit, and tries to recover with unsafe lane changes. Bottom: With clear route guidance, the policy anticipates the turn and smoothly changes lanes to take the exit safely

Qualitative Result 3

Figure 6: Top: Trained on aggressive expert behavior with minimal safety margin, the policy brakes too late and collides with a pedestrian in clear conditions. Bottom: Trained on sensor-grounded behavior, the policy brakes early, stops in time, and waits until it is safe to proceed.

Demonstrations

Each video below presents a single, uncut rollout of up to one hour of continuous driving without any human intervention or reset. The policy robustly operates in dense traffic, through frequent intersections and dynamic agents, over long horizons and adverse weather conditions, using only noisy onboard sensor inputs.


Highlight Result

We showcase the policy in severely degraded visibility conditions, including dark environment with high level of occlusion, as well as narrow and highly curved roads. Several obstacles force the vehicle to frequently stop, wait for safe gap, and drive on the wrong side of the road to safely navigate through the scene.


Further Results

We provide extended qualitative results on the Longest6 v2 benchmark, featuring uninterrupted driving sequences spanning several minutes in both good and adverse weather conditions.


Uncut Results

To highlight long-horizon robustness, we also provide random videos on Town13, each with almost one hour of driving. This benchmark represents a regime in which small errors compound over time and typically lead to failure.

BibTeX

@article{Nguyen2025ARXIV,
  title={LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving},
  author={Nguyen, Long and Fauth, Micha and Jaeger, Bernhard and Dauner, Daniel and Igl, Maximilian and Geiger, Andreas and Chitta, Kashyap},
  journal={arXiv preprint arXiv:2512.20563},
  year={2025}
}

Acknowledgements

Bernhard Jaeger and Andreas Geiger were supported by the ERC Starting Grant LEGO-3D (850533) and the DFG EXC number 2064/1 - project number 390727645. Daniel Dauner was supported by the German Federal Ministry for Economic Affairs and Climate Action within the project NXT GEN AI METHODS (19A23014S). We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Bernhard Jaeger, Daniel Dauner, and Kashyap Chitta. This research used compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG as well as the Training Center for Machine Learning (TCML). We also thank Lara Pollehn and Simon Gerstenecker for helpful discussions.