Randomized Clustering Forests for Building Fast and Discriminative
Visual Vocabularies
Frank Moosmann, Bill Triggs, Frederic Jurie
Image Retrieval and Recognition Using Local Distance Functions
A. Frome, Y. Singer, J. Malik.
Multi-Task Feature Learning
Andreas Argyriou, Theos Evgeniou, Massimiliano Pontil
You can also watch Andreas Argyriou present his paper
here.
Boosting Structured Prediction for Imitation Learning
Nathan Ratliff, David Bradley, Drew Bagnell, Joel Chestnutt
with the background paper:
Semi supervised conditional random fields for improved sequence
segmentation and labeling
by F. Jiao, S. Wang, C. Lee, R. Greiner and D. Schuurmans
and another background paper:
Semi-Supervised Learning by Entropy Minimization
by Y. Grandvalet and Y. Bengio, from NIPS 2004
Analysis of Contour Motions,
Ce Liu, William T. Freeman, and Edward H. Adelson
In addition, I will likely draw on the following papers, which are also related to motion of boundaries and occlusion events:
This talk comes with a guarantee: once it's done, you'll be able to go back to your office or cube and implement a Dirichlet Process Mixture Model on your own---or your money back!
I will cover topics from some of the following papers---the first is a terrific reference, and the rest can serve as a "seed bibliography" on the subject:
The talk will draw from the following papers (listed in reverse chronological order):
This paper provides constraint-based algorithms for learning Bayesian network structures from data, that require only polynomial numbers of conditional independence (CI) tests. The exponential complexity on the number of CI tests is avoided by using some nice heuristics.
Abstract: Belief propagation over pairwise connected Markov Random Fields has become a widely used approach, and has been successfully applied to several important computer vision problems. However, pairwise interactions are often insufficient to capture the full statistics of the problem. Higher-order interactions are sometimes required. Unfortunately, the complexity of belief propagation is exponential in the size of the largest clique. In this paper, we introduce a new technique to compute belief propagation messages in time linear with respect to clique size for a large class of potential functions over real-valued variables.
We demonstrate this technique in two applications. First, we perform efficient inference in graphical models where the spatial prior of natural images is captured by 2x2 cliques. This approach shows significant improvement over the commonly used pairwise-connected models, and may benefit a variety of applications using belief propagation to infer images or range images. Finally, we apply these techniques to shape-from-shading and demonstrate significant improvement over previous methods, both in quality and in flexibility.
Gaussian Process Latent Variable Models for Visualisation of High
Dimensional Data
Neil D. Lawrence
WiFi-SLAM Using Gaussian Process Latent Variable Models
Brian Ferris, Dieter Fox, Neil Lawrence
Gaussian Process Dynamical Models
Jack M.Wang, David J. Fleet, Aaron Hertzmann
3D People Tracking with Gaussian Process Dynamical Models
Raquel Urtasun, David J. Fleet, Pascal Fua
Inferring Temporal Order of Images From 3D Structure
Grant Schindler, Sing Bing Kang, Frank Dellaert
Abstract:
In this paper, we describe a technique to temporally sort
a collection of photos that span many years. By reasoning
about persistence of visible structures, we show how this
sorting task can be formulated as a constraint satisfaction
problem (CSP). Casting this problem as a CSP allows us to
efficiently find a suitable ordering of the images despite the
large size of the solution space (factorial in the number of
images) and the presence of occlusions. We present experimental
results for photographs of a city acquired over a one
hundred year period.
The paper will be distributed by email. Please do not redistribute!
Abstract:
Within a computer vision context color naming is the
action of assigning linguistic color labels to image pixels.
In general, research on color naming applies the follow-
ing paradigm: a collection of color chips is labelled with
color names within a well-defined experimental setup by
multiple test subjects. The collected data set is subsequently
used to label RGB values in real-world images with a color
name. Apart from the fact that this collection process is
time consuming, it is unclear to what extent color naming
within a controlled setup is representative for color naming
in real-world images. In this paper, we propose to learn
color names from real-world images. We avoid test sub-
jects by using Google Image to collect a data set. From the
data set color names can be learned using a PLSA model
tailored to this task. Experimental results show that color
names learned from real-world images significantly outper-
form color names learned from labelled color chips on re-
trieval and classification.
Abstract:
In this paper we introduce and experiment with a framework for learning local
perceptual distance functions for visual recognition. We learn a distance function
for each training image as a combination of elementary distances between
patch-based visual features. We apply these combined local distance functions to
the tasks of image retrieval and classification of novel images. On the Caltech
101 object recognition benchmark, we achieve 60.3% mean recognition across
classes using 15 training images per class, which is better than the best published
performance by Zhang, et al.
Approximate Nearest Subspace Search with Applications to Pattern Recognition
by Ronen Basri, Tal Hassner, and Lihi Zelnik-Manor
Abstract:
Linear and affine subspaces are commonly used to describe
appearance of objects under different lighting, viewpoint,
articulation, and identity. A natural problem arising
from their use is ?given a query image portion represented
as a point in some high dimensional space ?find a subspace
near to the query. This paper presents an efficient solution
to the approximate nearest subspace problem for both linear
and affine subspaces. Our method is based on a simple
reduction to the problem of nearest point search, and can
thus employ tree based search or locality sensitive hashing
to find a near subspace. Further speedup may be achieved
by using random projections to lower the dimensionality of
the problem. We provide theoretical proofs of correctness
and error bounds of our construction and demonstrate its
capabilities on synthetic and real data. Our experiments
demonstrate that an approximate nearest subspace can be
located significantly faster than the exact nearest subspace,
while at the same time it can find better matches compared
to a similar search on points, in the presence of variations
due to viewpoint, lighting etc.
I'll discuss parts of these three papers (not the hardware, except for the basic theory):
And if you're especially interested, you can also look at:
Confocal Stereo
(Project Page)
by Samuel W. Hasinoff and Kiriakos N. Kutulakos
This paper got the Longuet-Higgins Best Paper Award, Honorable Mention.
Abstract:
We present confocal stereo, a new method for computing 3D shape by
controlling the focus and aperture of a lens. The method is specifically
designed for reconstructing scenes with high geometric complexity or
fine-scale texture. To achieve this, we introduce the confocal constancy
property, which states that as the lens aperture varies, the pixel
intensity of a visible in-focus scene point will vary in a
scene-independent way, that can be predicted by prior radiometric lens
calibration. The only requirement is that incoming radiance within the
cone subtended by the largest aperture is nearly constant. First, we
develop a detailed lens model that factors out the distortions in high
resolution SLR cameras (12MP or more) with large-aperture lenses (e.g.,
f1.2). This allows us to assemble an AxF aperture-focus image (AFI) for
each pixel, that collects the undistorted measurements over all A
apertures and F focus settings. In the AFI representation, confocal
constancy reduces to color comparisons within regions of the AFI, and
leads to focus metrics that can be evaluated separately for each pixel. We
propose two such metrics and present initial reconstruction results for
complex scenes, as well as for a scene with known ground-truth shape.
First:
The Identity Management Kalman Filter (IMKF)
B. Schumitsch, S. Thrun, L. Guibas, K. Olukotun
Abstract: Tracking posteriors estimates for problems with data association uncertainty is one of the big open problems in the literature on filtering and tracking. This paper presents a new filter for online tracking of many individual objects with data association ambiguities. It tightly integrates the continuous aspects of the problem -- locating the objects -- with the discrete aspects -- the data association ambiguity. The key innovation is a probabilistic information matrix that efficiently does identity management, that is, it links entities with internal tracks of the filter, enabling it to maintain a full posterior over the system amid data association uncertainties. The filter scales quadratically in complexity, just like a conventional Kalman filter. We derive the algorithm formally and present large-scale results.
Second, if I get to it:
Multi-object tracking with representations of the symmetric group.
R. Kondor, A. Howard and T. Jebara: AISTATS 2007.
1. A. Hoogs, R. Collins, B. Kaucic and J. Mundy. A Common Set of Perceptual Observables for Grouping, Figure-Ground Discrimination and Texture Classification. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Special Section on Perceptual Organization in Computer Vision, 25(4)
2. Learning to segment images using region-based perceptual features, Kaufhold, J. Hoogs, A., CVPR 2004
3. Supervised learning of large perceptual organization: graph spectral partitioning and learning automata, Sarkar, S.; Soundararajan, P., PAMI 2000, vol 22, no 5.
We propose to use high-level visual information to improve illuminant estimation. Several illuminant estimation approaches are applied to compute a set of possible illuminants. For each of them an illuminant color corrected image is evaluated on the likelihood of its semantic content: is the grass green, the road grey, and the sky blue, in correspondence with our prior knowledge of the world. The illuminant resulting in the most likely semantic composition of the image is selected as the illuminant color. To evaluate the likelihood of the semantic content, we apply probabilistic latent semantic analysis. The image is modelled as a mixture of semantic classes, such as sky, grass, road, and building. The class description is based on texture, position and color information. Experiments show that the use of high-level information improves illuminant estimation over a purely bottom-up approach. Furthermore, the proposed method is shown to significantly improve semantic class recognition performance.
The highlight will be a ICCV '07 paper [1] that uses a clever and simple math trick to give a non-iterative O(n) solution to the problem, as or more accurate than state-of-the-art methods that are O(n^5) or more! This should be of particular interest to people doing sensor calibration, and model-based pose estimation and tracking.
[1] Accurate Non-iterative O(n) solution to the PnP problem, F.Moreno-Noguer, V.Lepetit and P.Fua, ICCV '07 preprint
and it's closest competitor:
[2]
Fast and Globally Convergent Pose Estimation from Video Images,
C.P.Lu, G.Hager and E.Mjolsness, PAMI 2000
One is a ICCV 2007 Paper:
Till Quack, Vittorio Ferrari, Bastian Leibe and Luc Van Gool,
Efficient Mining of Frequent and Distinctive Feature Configurations
(to appear) ICCV 2007, Rio de Janeiro, Brazil
The other is:
Till Quack, Vittorio Ferrari, Luc Van Gool,
Video Mining with Frequent itemset Configurations.
CIVR 2006, Tempe, AZ, USA, July 2006
Learning Graph Matching, Tiberio Caetano, Li Cheng, Quoc Le, Alex Smola, ICCV 2007.
I'll be giving a practice job talk about my work in using volumetric features for event detection. It's based on the following papers:
Abstract:
Real-world actions occur often in crowded, dynamic environments. This
poses a difficult challenge for current approaches to video event
detection because it is difficult to segment the actor from the
background due to distracting motion from other objects in the scene.
We propose a technique for event recognition in crowded videos that
reliably identifies actions in the presence of partial occlusion and
background clutter. Our approach is based on three key ideas: (1) we
efficiently match the volumetric representation of an event against
over-segmented spatio-temporal video volumes; (2) we augment our
shape-based features using flow; (3) rather than treating an event
template as an atomic entity, we separately match by parts (both in
space and time), enabling robustness against occlusions and actor
variability. Our experiments on human actions, such as picking up a
dropped object or waving in a crowd show reliable detection with few
false positives.
AppWand: Editing Measured Materials using Appearance-Driven Optimization, by Fabio Pellacini and Jason Lawrence. SIGGRAPH 2007
3D generic object categorization, localization, and pose estimation
S. Savarese and L. Fei-Fei
This paper is about building 3D part-based models of object categories. An object is comprised of a collection of 2D "canonical part" images (e.g. a car's bumper) linked to each other through coordinate transforms: imagine arranging polaroids of object parts on a sphere surrounding the object. The technique learns the model of a class in an unsupervised way from still images of instances of the class. I like this model because I think the brain represents objects in a similar way, and if I have time to prepare the information I might say something about that.
A projection model is presented for cameras moving at constant velocity (which we refer to as Galilean cameras). We introduce the concept of spacetime projection and show that perspective imaging and linear pushbroom imaging are specializations of the proposed model. The epipolar geometry between two such cameras is developed and we derive the Galilean fundamental matrix. We show how six different "fundamental" matrices can be directly recovered from the Galilean fundamental matrix including the classic fundamental matrix, the Linear Pushbroom fundamental matrix and a fundamental matrix relating Epipolar Plane Images. To estimate the parameters of this fundamental matrix and the mapping between videos in the case of planar scenes we describe linear algorithms and report experimental performance of these algorithms.