Research

Optimizing Average Precision using Weakly Supervised Data

Aseem Behl, Pritish Mohapatra, C. V. Jawahar and M. Pawan Kumar

Overview

We develop a novel latent AP-SVM that minimizes a carefully designed upper bound on the AP-based loss function over a weakly supervised dataset. Our approach is based on the hypothesis that in the challenging setting of weakly supervised learning, it becomes crucial to optimize the right accuracy measure. Using publicly available datasets, we demonstrate the advantage of our approach over standard loss-based binary classifiers on challenging problems in computer vision.

Optimizing Average Precision

* Given training dataset consists of n samples X along with the pairwise ranking matrix Y, we would like to learn the parameters of the classifier w such that the Average Precision (AP) over the training dataset is maximised.

* In order to maximize the AP, we minimize a loss function defined as Δ(Y;Y*) = 1 - AP(Y;Y*), where AP(Y;Y*) denotes the AP of the ranking matrix Y with respect to the true ranking Y*.

* The additional annotations for the positive and negative samples H_P and H_N respectively are unknown. Since the best value of the additional annotation for each positive sample needs to be imputed automatically, we consider these as latent variables. However, since a negative sample remains negative regardless of the value of the additional annotation, we need to consider all possible values of H_N during parameter estimation.

Our Method: Latent AP-SVM

Significant difference between the optimization corresponding to latent AP-SVM and the standard latent structured SVM (latent SSVM) is that, the constraints of latent AP-SVM are specified using the same values of the additional annotations. In contrast, in the constraints of latent SSVM problem, the values of the additional annotations for a correct and incorrect ranking are independent of each other. This modification in our formulation provides an intuitive prediction rule, a tighter upper bound on the AP loss, and lends itself to efficient optimization.

The local minimum or saddle point solution of latent AP-SVM problem can be obtained using the concave-convex procedure (CCCP) algorithm. The algorithm involves two main steps. In the first step, it imputes the best additional annotations H_P of the positive samples given the current estimate of the parameters. In the second step, given the imputed values of H_P, CCCP updates the parameters by solving the resulting convex optimization problem.

Results

We tested latent AP-SVM using two problems in computer vision that can be formulated as a weakly supervised learning problem. Traditionally, the parameters are learned by optimizing a surrogate like 0-1 loss (latent SVM) or the standard latent structured SVM (latent SSVM). Therefore, we provide a comparison of latent AP-SVM with the latent SVM and latent SSVM. The details of all the experiments are provided in our CVPR 2014 paper.

Action Classification

Input is a set of person images X along with the category of the actions being performed in each image. The hidden variables H model the bounding box of the person in the image. We use the standard poselet-based activations to specify the features.

We use the PASCAL VOC 2011 action classification dataset, which consists of 4846 images depicting 10 action classes. The dataset is divided into two subsets: 2424 'trainval' images for which we are provided the bounding boxes of the persons in the image together with their action class; and 2422 'test' images for which we are only provided with the person bounding boxes. The putative values of each latent variable are restricted to boxes obtained by a standard person detector.

Figure below shows the significance of using the right accuracy measure. It shows a sub-sequence of images from two rankings predicted for the 'ridingbike' action. The top and bottom rows are from the rankings predicted by latent SVM and latent AP-SVM respectively. Both latent SVM and latent AP-SVM have two incorrect and correct images each, but latent AP-SVM puts the incorrect images in the lower rank. This may be explained by the fact that, in terms of 0/1 loss both predictions are equivalent. However, the bottom ranking sequence is preferred by a classifier that is attempting to optimize the AP loss. The figure below shows the best mean average precision over all 10 action classes obtained during 5-fold cross validation. The x axis corresponds to the amount of supervision provided. The y axis corresponds to the mean average precision. As the amount of supervision decreases, the gap in the performance of latent AP-SVM and the baseline methods increases, thereby illustrating the importance of using the correct loss function and the correct learning formulation for weakly supervised learning.

Character Recognition in Natural Images

Input is a set of word images X along with the characters present in each image. The hidden variables H model the bounding box of the characters in the image. We use the histogram of oriented gradients (HOG) to specify the features.

We use the IIIT 5K-WORD scene text dataset, which consists of 5000 cropped word images from scene texts and born-digital images, which are divided into 2000 'trainval' images and 3000 'test' images. Each image is annotated with the corresponding word. We treat the bounding box of the characters as latent variables whose putative values restricted to boxes obtained by a standard character detector. Using this dataset, we perform binary classification for the 22 classes that contain at least 150 samples in the 'trainval' dataset.

Figure below shows the best AP values for all the 5 classes where the performance of latent AP-SVM is statistically different from that of latent SVM (using paired t-test with p-value less than 0.05). Latent AP-SVM provides statistically significant improvements over latent SVM for only 4 out of 5 classes. Figure below shows the AP values for the 5 statistically significant characters on the 'test' set. Similar to the crossvalidation results, latent AP-SVM outperforms latent SVM for 4 out of the 5 classes.

Figure below shows the best AP values for all the 4 classes where the performance of latent AP-SVM is statistically different from that of latent SSVM (using paired t-test with p-value less than 0.05). Latent AP-SVM provides statistically significant improvements over latent SSVM for only 3 out of 4 classes. Figure below shows the AP values for the 4 statistically significant characters on the 'test' set. Similar to the crossvalidation results, latent AP-SVM outperforms latent SSVM for 3 out of the 4 classes.

Resources

All the code used in our experiments is available here.

Poselet features and meta files for PASCAL VOC 2011 action classification used in the experiments are available here and here respectively.

Please feel free to email me at aseem dot behl at gmail dot com in case of any difficulties while running the experiment.

Publications

Aseem Behl, C. V. Jawahar and M. Pawan Kumar
Optimizing Average Precision using Weakly Supervised Data
In Computer Vision and Pattern Recognition (CVPR), 2014

Aseem Behl, C. V. Jawahar and M. Pawan Kumar
Optimizing Average Precision using Weakly Supervised Data (supplementary material)
In Computer Vision and Pattern Recognition (CVPR), 2014

Aseem Behl, Pritish Mohapatra, C. V. Jawahar and M. Pawan Kumar
Optimizing Average Precision using Weakly Supervised Data
In Pattern Analysis and Machine Intelligence (PAMI), 2015