What’s new
Overview
Introduction
The VideoCompletion project introduces the first benchmark for video-completion methods. We present results for different methods on a range of diverse test sequences which are available for viewing on a player equipped with a movable zoom region. Additionally, we provide the results of an objective analysis using quality metrics that were carefully selected in our study on video-completion perceptual quality. We believe that our work can help rank existing methods and assist developers of new general-purpose video-completion methods.
Data set
Our current data set consists of 7 video sequences with ground-truth completion results. We consider object removal, so the test sequences are constructed by composing various foreground objects over a set of background videos. Some of these background videos include left-view sequences from the stereoscopic-video data set RMIT3dv [1 ]. As foreground objects we use those employed in the video-matting benchmark [2 ] as well as several 3D models. To seamlessly insert a 3D model in a background video we use Blender [3 ] motion-tracking tools. Each video-completion method takes the composited sequence and the corresponding object mask as input.
Evaluation Methodology
Video-completion results are seldom explicitly expected to adhere to ground truth and are usually judged only by their plausibility, which is assessed by a human observer. It makes objective quality assessment of video completion an inherent problem. However, by relaxing the requirement of complete adherence to ground truth we can increase correlation with perceptual completion quality. This benchmark employs four quality metrics: MS-DSSIM, MS-DSSIMdt, MS-CDSSIM, MS-CDSSIMdt. Thorough description and comparative analysis of these and other metrics can be found in our paper (to be published soon).
MS-DSSIM metric measures adherence of completion result V to ground truth video Vref in a multi-scale fashion with scale weights determined using perceptual quality data. It is based on the structural similarity index (SSIM) [4 ] values computed for all ×ばつ9 luminance patches P(x) within the spatio-temporal hole .
Here superscript i denotes the level of the Gaussian pyramid—that is, Vref0 is the original ground-truth video, and Vref1 is the video blurred and subsampled by a factor of two in both spatial dimensions.
MS-DSSIMdt metric captures temporal coherency along ground-truth optical-flow vectors .
MS-CDSSIM relies on the assumption that completion result should be locally similar to the ground truth—that is, each patch P(x) within the spatio-temporal hole should have a similar ground-truth patch Pref(y).
MS-CDSSIMdt is a temporal stability metric that uses the same assumptions as MS-CDSSIM. Essentially it captures the changes in patch appearance from frame to frame, as opposed to evaluating consistency with ground-truth optical flow using MS-DSSIMdt. To do so we find for a given patch the most similar patch from the previous frame within a certain window, compute the distances from these patches to the most similar ground-truth patches and then compare the respective distances.
Here denotes a square window of pixels (we use equal to 1/10th of the frame width) spatially centered at and located in the previous frame.
Exact computation of MS-CDSSIM and MS-CDSSIMdt quickly becomes impractical for larger spatio-temporal holes, so we resort to approximate solutions based on the PatchMatch [5 ] algorithm.
Participate
We invite developers of video-completion methods to use our benchmark. We can evaluate the submitted data and report quality scores to the developer. In cases where the developer specifically grants permission, we will publish the results on our site. The test sequences with the respective completion masks are available for download: Deck, Library, Fountain, Wires, Tower, Skyscrapers, Sign.
For evaluation requests or if you have any questions or suggestions please feel free to contact us by email: abokov@graphics.cs.msu.ru.
Evaluation
| Objective metric values | ||||||||
rank | Deck | Library | Fountain | Wires | Tower | Skyscrapers | Sign | |
|---|---|---|---|---|---|---|---|---|
| Background Reconstruction+ [6] | 1.9 | 0.2211 | 0.1922 | 0.2174 | 0.0701 | 0.0902 | 0.1082 | 0.0831 |
| PFClean Remove Rig+ [7] | 2.7 | 0.3074 | 0.1871 | 0.0771 | 0.0942 | 0.1434 | 0.1634 | 0.1063 |
| Planar Structure Guidancei [8] | 5.4 | 0.3185 | 0.6035 | 0.6826 | 0.2406 | 0.1775 | 0.3025 | 0.4386 |
| Nuke F_RigRemoval+ [9] | 2.0 | 0.2912 | 0.2113 | 0.0782 | 0.1203 | 0.0681 | 0.0911 | 0.1042 |
| Telea Inpaintingi [10] | 4.7 | 0.3336 | 0.6236 | 0.6145 | 0.2065 | 0.1413 | 0.1333 | 0.3675 |
| Complex Scenesm [11] | 4.3 | 0.3073 | 0.2524 | 0.1163 | 0.1624 | 0.1956 | 0.3556 | 0.2374 |
| Background Reconstruction+ [6] | 1.7 | 0.0131 | 0.0051 | 0.0364 | 0.0071 | 0.0072 | 0.0092 | 0.0041 |
| PFClean Remove Rig+ [7] | 2.9 | 0.0183 | 0.0072 | 0.0031 | 0.0113 | 0.0135 | 0.0164 | 0.0092 |
| Planar Structure Guidancei [8] | 6.0 | 0.1566 | 0.3016 | 0.4586 | 0.0926 | 0.0676 | 0.1456 | 0.1866 |
| Nuke F_RigRemoval+ [9] | 2.9 | 0.0205 | 0.0124 | 0.0042 | 0.0124 | 0.0031 | 0.0061 | 0.0093 |
| Telea Inpaintingi [10] | 4.4 | 0.0132 | 0.0925 | 0.1975 | 0.0165 | 0.0104 | 0.0195 | 0.0465 |
| Complex Scenesm [11] | 3.1 | 0.0184 | 0.0093 | 0.0083 | 0.0112 | 0.0093 | 0.0163 | 0.0214 |
| Background Reconstruction+ [6] | 2.0 | 0.1181 | 0.0563 | 0.1194 | 0.0401 | 0.0552 | 0.0812 | 0.0431 |
| PFClean Remove Rig+ [7] | 2.4 | 0.1414 | 0.0451 | 0.0201 | 0.0562 | 0.0883 | 0.1224 | 0.0462 |
| Planar Structure Guidancei [8] | 5.9 | 0.2006 | 0.2936 | 0.4096 | 0.1586 | 0.1186 | 0.2275 | 0.2886 |
| Nuke F_RigRemoval+ [9] | 2.4 | 0.1403 | 0.0754 | 0.0272 | 0.0723 | 0.0411 | 0.0691 | 0.0503 |
| Telea Inpaintingi [10] | 4.6 | 0.1835 | 0.2865 | 0.3135 | 0.1305 | 0.0924 | 0.1023 | 0.2125 |
| Complex Scenesm [11] | 3.7 | 0.1192 | 0.0532 | 0.0283 | 0.0904 | 0.1175 | 0.2786 | 0.0994 |
| Background Reconstruction+ [6] | 1.9 | 0.0151 | 0.0072 | 0.0244 | 0.0091 | 0.0092 | 0.0102 | 0.0071 |
| PFClean Remove Rig+ [7] | 2.0 | 0.0172 | 0.0071 | 0.0041 | 0.0112 | 0.0153 | 0.0133 | 0.0092 |
| Planar Structure Guidancei [8] | 6.0 | 0.0716 | 0.1056 | 0.1286 | 0.0606 | 0.0396 | 0.0816 | 0.0916 |
| Nuke F_RigRemoval+ [9] | 2.6 | 0.0204 | 0.0154 | 0.0062 | 0.0143 | 0.0081 | 0.0091 | 0.0113 |
| Telea Inpaintingi [10] | 4.9 | 0.0235 | 0.0655 | 0.0965 | 0.0275 | 0.0195 | 0.0164 | 0.0495 |
| Complex Scenesm [11] | 3.7 | 0.0183 | 0.0113 | 0.0083 | 0.0184 | 0.0174 | 0.0235 | 0.0214 |
+ regions that weren't reconstructed by the algorithm were filled afterwards using Telea image inpainting [10]
m owing to prohibitively high memory consumption the test sequences were downscaled to ×ばつ720 resolution
i image inpainting algorithms
- Deck
- Library
- Fountain
- Wires
- Tower
- Skyscrapers
- Sign