Most 3D object detection pipelines project LiDAR point clouds into a bird's-eye-view (BEV) Cartesian grid, run a 2-D or 3-D convolution, and call it done. This works. But it bakes in a mismatch between the sensor's native data structure and the representation your model sees.
This post covers the core ideas behind our NeurIPS 2021 paper on fast polar attentive detection, why polar space representations are a fundamentally better fit for rotating LiDAR sensors, and the engineering trade-offs you hit in practice.
The Cartesian–LiDAR mismatch
A rotating LiDAR sensor like the Velodyne HDL-64E fires lasers radially outward. Each return gives you $(r, \theta, \phi)$ — range, azimuth, elevation. The natural data structure is cylindrical: dense at close range, sparse at long range.
When you voxelize this into a $512 \times 512$ Cartesian grid:
- Near-field cells are over-sampled — multiple LiDAR beams hit the same voxel, wasting compute.
- Far-field cells are under-sampled — sparse returns are dropped or averaged, losing signal exactly where you need it most (long-range obstacles).
Polar representation
Instead of voxelizing into $(x, y)$, we voxelize into $(r, \theta)$: range and azimuth. This gives a range-azimuth image of shape $(R, A, C)$ where:
- $R$ = number of range bins (e.g., 512 for 0–100 m)
- $A$ = number of azimuth bins (e.g., 512 for 360°)
- $C$ = feature channels (intensity, height, occupancy, …)
Each LiDAR beam now maps to exactly one cell. The representation is uniformly dense — no over- or under-sampling by design.
Attention in polar space
We use a learned positional bias that scales the attention weights by range:
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}} + B(r)\right) V$$where $B(r) \in \mathbb{R}^{H \times L \times L}$ is a learned per-range-bin bias. This lets the model adaptively widen or narrow its effective receptive field with range.
Streaming inference
A standard LiDAR stack accumulates a full 360° sweep before running detection (~100 ms at 10 Hz). With a polar representation you can process partial sweeps:
$$\text{latency}_{\text{Cartesian}} \approx 100 + 80 = 180 \text{ ms}$$ $$\text{latency}_{\text{polar streaming}} \approx \frac{100}{12} + 12 \approx 20 \text{ ms}$$This 9× latency reduction is the main practical argument for polar representations in safety-critical systems.
Key implementation
def polar_voxelize(points, r_bins=512, a_bins=512, r_max=100.0):
x, y, z, intensity = points[:, 0], points[:, 1], points[:, 2], points[:, 3]
r = torch.sqrt(x**2 + y**2).clamp(0, r_max)
a = (torch.atan2(y, x) + math.pi) / (2 * math.pi) # [0, 1]
r_idx = (r / r_max * r_bins).long().clamp(0, r_bins - 1)
a_idx = (a * a_bins).long().clamp(0, a_bins - 1)
voxel = torch.zeros(r_bins, a_bins, 4)
voxel[r_idx, a_idx] = torch.stack([x, y, z, intensity], dim=-1)
return voxel # (R, A, C) ready for 2-D backbone
Results
| Method | AP_3D ↑ | Latency ↓ |
|---|---|---|
| PointPillars | 77.3 | 16 ms |
| CenterPoint | 85.0 | 22 ms |
| Ours (polar attn) | 86.2 | 11 ms |
Non-obvious trade-offs
Distortion at close range. In polar space, a square object near the sensor looks like a trapezoid. The model must learn to undo this during box regression — in practice not a problem, but it makes data augmentation harder.
Azimuth resolution. Finer azimuth = more bins = more compute. 0.5° (720 bins) is a sweet spot for vehicles; pedestrian detection benefits from 0.25° at the cost of 2× tensor size.
References
- Bhat et al., Fast Polar Attentive 3D Object Detection on LiDAR Point Clouds, NeurIPS 2021 Workshop. PDF