Neural Rendering for AV Simulation — Manoj Bhat (Preview)

Simulation is the backbone of AV development. You can't test a software update on 10 million real miles every sprint — you simulate them. But classical simulation (hand-crafted 3-D assets, rule-based sensor models) has a fundamental gap: the simulated world doesn't look like the real world, and perception models trained or evaluated in simulation don't transfer cleanly.

Neural rendering — specifically Neural Radiance Fields (NeRF) and its derivatives — offers a different deal: reconstruct a photorealistic, controllable scene representation directly from real sensor logs. At Rivian we ran a hackathon exploring this for our autonomy stack, which resulted in a patent. This post is a technical write-up of what we built, what worked, and what surprised us.

Why simulation fidelity matters so much

Consider a simple example: a perception model that detects vehicles. In classical simulation, vehicles are rendered as textured 3-D meshes under an approximated sensor model. In the real world, LiDAR returns are affected by surface reflectivity, material properties, rain drops, and sensor-specific scan patterns — none of which the classical model captures faithfully.

Neural rendering directly sidesteps the asset pipeline. Instead of modeling scenes analytically, it fits a continuous function:

$$f_\theta: (\mathbf{x}, \mathbf{d}) \to (\mathbf{c}, \sigma)$$

mapping a 3-D point $\mathbf{x}$ and view direction $\mathbf{d}$ to color $\mathbf{c}$ and volume density $\sigma$. Volume rendering then integrates along camera rays to produce photorealistic images.

The AV-specific challenges

Dynamic agents

A parked truck is fine. A moving cyclist is a disaster — it occupies different spatial positions in each frame, creating ghosted artifacts. We decompose the scene into a static background NeRF and per-object NeRFs, each tracked in their own object-centric coordinate frame:

$$\mathbf{x}_{\text{obj}} = R_k(t)^\top (\mathbf{x}_{\text{world}} - \mathbf{t}_k(t))$$

Sparse viewpoint coverage

A camera driving down a road sees each scene element from a narrow cone of viewpoints. We regularize with a depth prior from LiDAR:

$$\mathcal{L}_\text{depth} = \left\| \hat{d} - d_\text{LiDAR} \right\|_1, \quad \hat{d} = \int_0^\infty t \cdot T(t)\sigma(t)\,dt$$

We extended the radiance field to output LiDAR intensity and range in addition to RGB — a single unified scene representation that renders both modalities from the same density field.

What we built at Rivian

The hackathon prototype combined three components:

Instant-NGP backbone (hash-encoded NeRF) for fast scene optimization — ~15 min per log segment on 4× A100s vs. hours for vanilla NeRF.
Object-centric decomposition using detections from our production perception stack to initialize object pose tracks, then jointly refining scene and object representations.
LiDAR branch: a two-head network outputting both RGB (camera supervision) and $(r, \text{intensity})$ (LiDAR supervision) from the same density field.

The key result: for static/semi-static scenes, we could synthesize sensor data from novel viewpoints that was indistinguishable from real data to our perception model (measured by detection AP on held-out synthetic vs. real comparisons).

What still breaks

Fast motion. A cyclist traveling at 25 km/h moves ~0.7 m per frame at 10 Hz. When the 3-D tracker loses the object (occlusion, crowded scenes), the NeRF degrades significantly.

Long log segments. NeRF optimized over a 60-second log covers more of the scene but is much harder to optimize. The scene changes and the neural field has no way to represent time-varying static structure cleanly.

Radar. Radar reflections have fundamentally different physics (specular multipath, Doppler) that don't map naturally to a density-based volume.

Where this is going

The next frontier is generative NeRF — instead of reconstructing a scene from real sensor data, generate plausible new scenes conditioned on map layout and agent behavior specifications. Diffusion-model-based approaches that output driving video conditioned on scene graphs are emerging from Waymo, Wayve, and academic labs.

Thesis Neural rendering isn't replacing simulation — it's replacing the asset pipeline. The physics simulation stays. What changes is how scene geometry and appearance are represented: from hand-crafted meshes to learned continuous fields fit directly to real-world sensor logs.

References

Mildenhall et al., NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020.
Müller et al., Instant Neural Graphics Primitives, SIGGRAPH 2022.
Tancik et al., Block-NeRF: Scalable Large Scene Neural View Synthesis, CVPR 2022.
UniSim, A Neural Closed-Loop Sensor Simulator, CVPR 2023 (Waymo).