In the sparse-to-dense depth completion problem, we seek to infer the dense depth map of a 3-D scene using an RGB image and its associated sparse depth measurements in the form of a sparse depth map, obtained either from computational methods such as SfM (Strcuture-from-Motion) or active sensors such as lidar or structured light sensors.
We propose a method that leverages the abundance of synthetic data (where groundtruth comes for free) and unannotated real data to learn cross modal fusion for depth completion.
The challenge of Sim2Real: There exists a covariate shift, mostly photometric, between synthetic and real domains, making it difficult to transfer models trained on synthetic source data to the target real data. Instead one might observe that, unlike photometry, the geometry persists for a given scene across domains. So we can bypass the photometric domain gap by learning the association not from photometry to geometry or from images to shapes, but from sparse geometry (point clouds) to topology by using the abundance of synthetic data. In doing so we can bypass the synthetic to real domain gap without having to face concerns about covariate shift and domain adaptation.
ScaffNet: The challenge of sparse-to-dense depth comppletion is precisely the sparsity. To learn a representation of the sparse point cloud that can capture the complex geometry of objects, we introduce ScaffNet, an encoder decoder network augmented with our version of Spatial Pyramid Pooling (SPP) module. Our SPP module performs max pooling with various kernel sizes to densify the inputs and to capture different receptive fields and learns to balance the tradeoff between density and details of the sparse point cloud.