Simulation is the backbone of AV development. You can't test a software update on 10 million real miles every sprint — you simulate them. But classical simulation (hand-crafted 3-D assets, rule-based sensor models) has a fundamental gap: the simulated world doesn't look like the real world, and perception models trained or evaluated in simulation don't transfer cleanly.
Neural rendering — specifically Neural Radiance Fields (NeRF) and its derivatives — offers a different deal: reconstruct a photorealistic, controllable scene representation directly from real sensor logs. At Rivian we ran a hackathon exploring this for our autonomy stack, which resulted in a patent. This post is a technical write-up of what we built, what worked, and what surprised us.
Why simulation fidelity matters so much
Consider a simple example: a perception model that detects vehicles. In classical simulation, vehicles are rendered as textured 3-D meshes under an approximated sensor model. In the real world, LiDAR returns are affected by surface reflectivity, material properties, rain drops, and sensor-specific scan patterns — none of which the classical model captures faithfully.
Neural rendering directly sidesteps the asset pipeline. Instead of modeling scenes analytically, it fits a continuous function:
$$f_\theta: (\mathbf{x}, \mathbf{d}) \to (\mathbf{c}, \sigma)$$mapping a 3-D point $\mathbf{x}$ and view direction $\mathbf{d}$ to color $\mathbf{c}$ and volume density $\sigma$. Volume rendering then integrates along camera rays to produce photorealistic images.
The AV-specific challenges
Dynamic agents
A parked truck is fine. A moving cyclist is a disaster — it occupies different spatial positions in each frame, creating ghosted artifacts. We decompose the scene into a static background NeRF and per-object NeRFs, each tracked in their own object-centric coordinate frame:
$$\mathbf{x}_{\text{obj}} = R_k(t)^\top (\mathbf{x}_{\text{world}} - \mathbf{t}_k(t))$$Sparse viewpoint coverage
A camera driving down a road sees each scene element from a narrow cone of viewpoints. We regularize with a depth prior from LiDAR:
$$\mathcal{L}_\text{depth} = \left\| \hat{d} - d_\text{LiDAR} \right\|_1, \quad \hat{d} = \int_0^\infty t \cdot T(t)\sigma(t)\,dt$$Multi-modal sensor rendering
We extended the radiance field to output LiDAR intensity and range in addition to RGB — a single unified scene representation that renders both modalities from the same density field.
What we built at Rivian
The hackathon prototype combined three components:
- Instant-NGP backbone (hash-encoded NeRF) for fast scene optimization — ~15 min per log segment on 4× A100s vs. hours for vanilla NeRF.
- Object-centric decomposition using detections from our production perception stack to initialize object pose tracks, then jointly refining scene and object representations.
- LiDAR branch: a two-head network outputting both RGB (camera supervision) and $(r, \text{intensity})$ (LiDAR supervision) from the same density field.
The key result: for static/semi-static scenes, we could synthesize sensor data from novel viewpoints that was indistinguishable from real data to our perception model (measured by detection AP on held-out synthetic vs. real comparisons).
What still breaks
Fast motion. A cyclist traveling at 25 km/h moves ~0.7 m per frame at 10 Hz. When the 3-D tracker loses the object (occlusion, crowded scenes), the NeRF degrades significantly.
Long log segments. NeRF optimized over a 60-second log covers more of the scene but is much harder to optimize. The scene changes and the neural field has no way to represent time-varying static structure cleanly.
Radar. Radar reflections have fundamentally different physics (specular multipath, Doppler) that don't map naturally to a density-based volume.
Where this is going
The next frontier is generative NeRF — instead of reconstructing a scene from real sensor data, generate plausible new scenes conditioned on map layout and agent behavior specifications. Diffusion-model-based approaches that output driving video conditioned on scene graphs are emerging from Waymo, Wayve, and academic labs.
References
- Mildenhall et al., NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020.
- Müller et al., Instant Neural Graphics Primitives, SIGGRAPH 2022.
- Tancik et al., Block-NeRF: Scalable Large Scene Neural View Synthesis, CVPR 2022.
- UniSim, A Neural Closed-Loop Sensor Simulator, CVPR 2023 (Waymo).