Challenge Goals

The challenge is a standard depth estimation problem where for each of the N stereo views in the dataset, the objective is to predict the depth at each pixel in the left eye of the stereo pair. We do not require participants use stereo correspondence to estimate this depth although we provide stereo data in case it is useful. 

Data

We have collected datasets of fresh porcine cadaver abdominal anatomy using a da Vinci Xi endoscope and a projector to obtain high quality depth maps of the scene. Ground truth has been established by projecting a sequence of coded structured light images that uniquely encode every projector pixel following the methods of [1,2]. 

In each dataset, we perform this process for 5-10 different camera positions obtaining high quality depth maps for both the left and right eye of the stereo camera pair. The values in the ground truth depth maps will be in mm and invalid pixels will be masked out. As the camera is required to be stationary during each structured light projection, we increase the size of each dataset with camera motion and warped depth maps using the known camera poses from the da Vinci Xi kinematics. These camera poses (relative to the first frame) will be released as a 4x4 matrix alongside the stereo camera calibration for the sequence. 

Some example frames showing our data capture setup. The projector illuminates the scene with several patterns which are used to create a high quality reconstruction.

Rules

External training data which is publicly available is allowed but must be disclosed in the submission. Addition stereo endoscopic video will be available on the challenge website that may be useful in training unsupervised depth estimation methods.

Participants must submit an entry for every pixel, we will penalize each pixel which does not have a valid entry (e.g. a NaN or Inf value) with an fixed penalty value that will be much larger than a typical bad guess. 

Evaluation Criteria

To evaluate the depth estimates we will use 2 measures informed by the Middlebury challenge. Our first measure will be a 'bad N' score where participants will be ranked according to the number of pixels in their depth map which have an error above N. The value of N will not be revealed until the challenge date. The second measure will be a per-pixel accuracy measure. We are deliberately vague about these measures to avoid participants overly fine-tuning their methods to a particular baseline. 

References

[1] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structured light. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), volume 1, pages 195-202, Madison, WI, June 2003.

[2] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on Pattern Recognition (GCPR 2014), Münster, Germany, September 2014.