Generalized Low-Rank Transfer Subspace Learning

Ming Shao, Dmitry Kit, and Yun Fu

In this page, we present our work published in [1] and [2], which aims at building a generalized framework for transfer learning in a learned subspace.

 

Why Knowledge Transfer

Visual classification tasks often suffer from insufficient labeled data because these data are either too costly to obtain or too expensive to hand-label. For that reason, researchers use labeled, yet relevant, data from different databases to facilitate learning process. A common assumption is that consistency exists between training and test data, which means the training and test data should have similar distributions or shared subspaces. This assumption is often wrong, especially in complex applications. Below are a few examples:

  • In image annotation, due to high labor cost, we expect to reuse images that have already been annotated. However, test images from target domain are either obtained under different conditions, (e.g. different capture devices), or contain novel objects unseen in the training dataset. Even in this situation, all annotated images should be able to be leveraged.
  • In sentiment analysis, analysts label large amounts of documents manually, but that set of labeled data is still tiny compared to the set of data that need to be classified. These test documents can use different vocabularies, and contain different topics. Despite the syntactic differences, we would still like to leverage all the manually labeled documents in classifying the new documents.
  • In face recognition, the task is to infer the identity of a person, but often there are only a few face images available in the reference set. We are interested in using a large amount of available face images from other datasets for training, and transfer the learned knowledge to the target dataset.

 

How to Transfer

In this project, we discuss a method for projecting both source and target data to a generalized subspace where each target sample can be represented by some combination of source samples. By employing a low-rank constraint during this transfer, the structure of source and target domains are preserved. This approach has three benefits. First, good alignment between the domains is ensured through the use of only relevant data in some subspace of the source domain in reconstructing the data in the target domain. Second, the discriminative power of the source domain is naturally passed on to the target domain. Third, noisy information will be filtered out in the knowledge transfer. The whole framework is illustrated Figure 1:

image002

The basic assumption behind our idea is, if each datum in a specific neighborhood in the target domain can be reconstructed by the same neighborhood in the source domain, then the source and target data might have similar distributions. In other words, each target datum reconstruction is not independent anymore; rather, reconstructions of target data in a neighborhood should correspond to the data from the same neighborhood in the source domain. We show this in Figure 3. This locality aware reconstruction, which has been widely explored in manifold learning, e.g., LLE guarantees that source and target data have similar geometrical property.

image004

 

Proposed Models

The proposed model can be briefly stated as: Given test data from a union of multiple subspaces in the target domain and training data from a union of multiple subspaces in the source domain, the goal is to find a discriminative subspace image006 where each test datum can be linearly represented by the data from some subspace image008 in the source domain. That is:

image010,

where image012 is a general subspace learning function, P is the projection matrix and Z is the coefficient matrix for reconstruction, However, there might be outliers that cannot be appropriately represented by the source data, as shown in the right figure. Therefore, we add an error term E account for the data samples that are far from the majority part of the source data, namely,

image020.

The discrete nature of the rank function makes it hard to solve. As suggested by [3], we could solve its convex surrogate instead, and reformate the original problem as:

image022,

To ensure the target data is reconstructed in the affine space of the source data, fit for general subspace learning methods, and the solution is closed, we add three more constraints in addition to the original one:

image024,      image024,    image028 .

Note that the term U2 is specifically useful in different subspace learning methods, e.g., PCA, LDA, LPP, NPE, MFA, DLA, which can be unified under graph embedding framework [4]. By changing image032, we can implement different transfer subspace learnings. The formulations for image034 can be found in Table 1:

image036

Augmented Lagrangian Multiplier Method (ALM) can efficiently solve the above problem. For details reader can refer to [3].

 

Experimental Results

We conduct two sets of experiments to visually and quantitatively demonstrate the effectiveness of the proposed framework, namely:

  • Face recognition application
  • Visual domain adaptation for object recognition

In face recognition, three experiments are conducted: kinship verification, Yale B and CMU PIE cross database face recognition, heterogeneous knowledge transfer. Sample faces from these databases can be found in Figure 7.

image038

Face recognition application

UB KinFace Ver2.0 is adopted for our kinship verification experiment, which consists of 600 images of 400 people which can be separated into 200 groups. Each group is composed of child, young-parent and old-parent images, and the image size is 127×100. Kinship verification can be described as: Given a pair of images, determine if people in the images have kin relationship. Features in this experiment are extracted by Gabor filters in 8directions and 5 scales. Five-fold cross-validation is used in this experiment. In each test, features of 160 young-parent and child images are selected for the source domain (Xs) and 160 old-parent and child images are used as the target domain (Xt).  Note that we use the absolute value of the feature’s difference between the child and the parent as the input to kinship verification. The difference of two people in the same family is used as the positive sample while that from different families as the negative sample. Naturally, there can be many more negative samples than positive ones and in our configuration we set the ratio to 160:640. We use the remaining 40old-parent and child images as test samples and the ratio of positive to negative samples is 40:40. Experimental results are shown in Figure 9 and Table 2. Note that

image044

image046

Yale B database contains face images from 38 subjects, each of which is under 64 different illuminations, and several camera views. For our experiments, we only use frontal views and natural expressions taken under different illumination conditions. Therefore, in total there are 2432 images. Similarly, CMU PIE includes $68$ subjects and each of them is under 21 different lighting conditions with environmental lights on or off. In our experiments, only frontal facial images with neutral expressions under 21 different illuminations with environmental lights off are used. We cropped the images to 30×30 and only use the raw images as the input. There are two experiments in this sub-section. First, we use Yale B database as the source domain and CMU PIE as the target domain, which we call Y2P. In the second experiment, we use CMU PIE and Yale B as the source and target domains, respectively and denote it as P2Y. Note only one face of each subject from the target domain in frontal view is used as the reference in these experiments. Specifically, in Y2P, all 38 subjects’ images in Yale B are  while all 68 subjects’ images in CMU PIE are Xt. In P2Y, we switch Xs and Xt. For either Y2P or P2Y, we randomly select one reference image per subject for each experiment, and repeat this five times. Average performances are reported in Table 3 and 4.

image048

image050

Heterogeneous images are ubiquitous in real-world application where a single object is recorded through different devices. One case is when people use near-infrared cameras to capture facial images in surveillance tasks to address illumination problems under the visible light. However, an unbalanced training and testing situation occurs where we often have a large amount of visual face images (VIS) while only a few labeled near-infrared facial (NIR) images are available. We showcase that the proposed LTSL can be applied to such scenario by using VIS as source images and NIR as target images, and title this problem VIS2NIR. We use BUAA-VISNIR database (Figure 7 (d)) for this heterogeneous knowledge transfer experiments and select images therein as our source and target images. We crop images to a size of 30×30, and randomly select 75 subjects and their corresponding VIS images as source data, and use the remaining 75 subjects and their corresponding NIR images as target data. Since each subject has 9 different images with different poses and expressions, there are a total of 675 facial images in the source domain and 675 in the target domain, without identity overlap. We use both unlabeled/labeled source data and unlabeled target data for training and use only one image per subject in the target domain as the reference. We repeat the test five times by choosing different reference images. Average performance is shown in Table 5.

image052

 

Visual Domain Adaptation

In this section, we demonstrate that the proposed LTSL framework can be applied to visual domain adaptation for objection recognition. We run our experiments on a dataset including images from four different databases, namely: Amazon, Caltech-256, DSLR, and Webcam, domain adaptation (4DA). In 4DA dataset, 10 common categories rather than 31 are selected from these 4 databases. 8 to 151 images per category per domain are chosen and in total it includes 2533 images. We strictly follow the configuration of [5] where 20 images per category from Amazon, Caltech-256, and Webcam, and 8 images per category from DSLR are randomly selected if they are in source domains; while 3 images are randomly selected if they are in target domains. Example images of laptop from four domains are illustrated in Figure 12.

image054

We evaluate the effectiveness of our method in both unsupervised and semi-supervised mode. In unsupervised mode, only labels from the source domain are used for reference; while in semi-supervised mode, labels from both source and target domains are used for reference. Results for 4DA are shonw in Table 8, and 9.

image056image058

References

[1] Ming Shao, Dmitry Kit, and Yun Fu, Generalized Transfer Subspace Learning through Low-Rank Constraint, International Journal on Computer Vision (IJCV), 2013. (in press)

[2] Ming Shao, Carlos Castillo, Zhenghong Gu, and Yun Fu, Low-Rank Transfer Subspace Learning, International Conference on Data Mining(ICDM), 2012.

[3] Guangcan Liu, Zhouchen Lin, Yong Yu, Robust subspace segmentation by low-rank representation. In: International Conference on Machine Learning, 2010

[4] Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, Steve Lin, Graph Embedding and Extensions: A General Framework for Dimensionality Reduction, IEEE Transactions on Patten Analysis and Machine Intelligence, 2007.

[5] Geodesic Flow Kernel for Unsupervised Domain Adaptation. B. Gong, Y. Shi, F. Sha, and K. Grauman. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.