LERF optimizes a dense, multi-scale language 3D field by volume rendering CLIP embeddings along

LERF: Language Embedded Radiance Fields

submited by
Style Pass
2023-03-17 02:00:04

LERF optimizes a dense, multi-scale language 3D field by volume rendering CLIP embeddings along training rays, supervising these embeddings with multi-scale CLIP features across multi-view training images. After optimization, LERF can extract 3D relevancy maps for language queries interactively in real-time. LERF enables pixel-aligned queries of the distilled 3D CLIP embeddings without relying on region proposals, masks, or fine-tuning, supporting long-tail open-vocabulary queries hierarchically across the volume.

With multi-view supervision, 3D CLIP embeddings are more robust to occlusion and viewpoint changes than 2D CLIP embeddings. 3D CLIP embeddings also conform better to the 3D scene structure, giving them a crisper appearance.

To supervise language embeddings, we pre-compute an image pyramid of CLIP features for each training view. Then, each sampled ray during optimization is supervised by interpolating the CLIP embedding within this pyramid.

Leave a Comment