1
IEEE 2017 Conference on Computer Vision and Pattern Recognition Online Asymmetric Similarity Learning for Cross-Modal Retrieval Yiling Wu 1;2 , Shuhui Wang 1 , Qingming Huang 1;2 1 Key Lab of Intell. Info. Process., Inst. of Comput. Tech, Chinese Academy of Sciences, China 2 University of Chinese Academy of Sciences, China Motivations : The critical problem in cross-modal retrieval task is how to measure the similarity between data from different modalities. The relations between images and texts are highly asymmetric. There are two kinds of relative similarities that can be used. CNN features are state-of-the-art features, but there are many CNN layers. Choosing which layer to use is a difficult problem. Contributions : We propose an online learning method to learn the similarity function between heterogeneous modalities by preserving the bi- directional relative similarity in the training data. We extend it to an online multiple kernel learning method to address the problem of combining different layers of CNN features for cross-modal retrieval. Learning Bi - direction Relative Similarity : Consider learning a bilinear similarity function Bi-directional relative similarity constraints are indispensable for modeling the cross-modal relation. We expect the similarity function to satisfy the following two conditions simultaneously: The model is trained by the Passive-aggressive (PA) algorithm. We call this method Cross-Modal Online Similarity function learning (CMOS). Framework of Training: Online Multiple Kernel Learning : To select multiple CNN layers, we derive its multiple kernel extension which called Cross-Modal Online Multiple Kernel Similarity function learning (CMOMKS). We extend the primal model to its dual from. A pair of kernel is used for each similarity function: We want to learn the coefficients of linear combinations of the M pairs of kernels, while at the same time we learn each similarity function. Let , we consider: At every iteration, for each of the M pairs of kernels, e.g., ( ; ), we apply the PA algorithm to find the optimal coefficient for the kernelized similarity function, and then apply the Hedging algorithm to update the combination weight by where equals to 1 when ; + ; ≤0 or + ; ; ≤0, and 0 otherwise. Mistake bound can be easily obtained according to the mistake bounds of the PA and the Hedge algorithm. Experiment Setup: Datasets: Pascal VOC 2007 consists of annotated consumer photographs collected from Flickr. After removing images without tags, we obtain a training set with 5000 images and a test set with 4919 images. Features: Images are represented by ‘conv2’, ‘conv5’, ‘fc6’, ‘fc7’,‘prob’ CNN features. We concatenate CNN features for primal methods, and average kernel matrices for KCCA and KGMLDA. Texts are represented by tag occurrence features. Performance: Conclusions: We have proposed CMOS and its multiple kernel extension CMOMKS to learn a similarity function between heterogeneous data modalities by preserving relative similarity constraints from two directions. The CMOS online model is learned by the Passive-Aggressive algorithm. Multiple kernelized similarity function is further combined in CMOMKS by the Hedging algorithm. Experimental results on three public cross-modal datasets have demonstrated that the proposed methods outperform state-of-the-art approaches. S 1 θ 1 S S 2 θ 2 S M θ M sample two directional triplets The Icelandic is a "five-gaited" breed, known for its sure-footedness and ability to cross rough terrain. As well as the typical gaits of walk, trot, and canter/gallop, the breed Stanford Memorial Church is located at the end of the mile-long axis of Stanford University, visible from a distance; the main vista begins at the main entrance, continues to Palm Drive, traverses "the Oval" After the defeat at Aspern-Essling, Napoleon took more than six weeks in planning and preparing for contingencies ''Southern Cross'' returned to Cape Adare from Australia on 28 January 1900. Borchgrevink began the process of dismantling the camp and transferring its supplies to the ship eparting from the main central administrative system generally known as the Three Departments and Six Ministries system, which was instituted by various dynasties since late Han (202 BCE - 220 CE), the Ming administration had onlyVrba was criticized in 2001 in a series of Leadership under Duress: The Working Group in Slovakia, 19421944‘’edited by a group of leading Israeli historians with ties to the Slovak communityDespite their size, ocean sunfish are docile, and pose no threat to human divers. Injuries from sunfish are rare, although there is a slight danger from large sunfish leaping out of the water onto boats; in one instance The Icelandic is a "five-gaited" breed, known for its sure-footedness and ability to cross rough terrain. As well as the typical gaits of walk, trot, and canter/gallop, the breed Vrba was criticized in 2001 in a series of articles‘’Leadership under Duress: The Working Group in Slovakia, 19421944‘’edited by a group of leading Israeli historians with ties to the Slovak communityimage features text features extract text features extract image features online update C1: 1 , 1 > ( 2 , 1 ) C3: 2 , 1 < 2 , 2 C2: 1 , 2 < ( 2 , 2 ) C4: 1 , 1 < ( 1 , 2 ) 2 1 1 2 C1: 1 , 1 > ( 2 , 1 ) C3: 2 , 1 < 2 , 2 C2: 1 , 2 < ( 2 , 2 ) 2 1 2 C4: 1 , 1 > ( 1 , 2 ) 1 method img2txt txt2img average CCA 0.346 0.395 0.371 PLS 0.532 0.490 0.511 GMLDA 0.550 0.539 0.545 ml-CCA 0.584 0.572 0.578 LCFS 0.551 0.528 0.540 Bi-CMSRM 0.541 0.516 0.529 SSI 0.576 0.530 0.553 CMOS 0.586 0.600 0.593 KCCA 0.647 0.655 0.651 KGMLDA 0.675 0.676 0.676 CMOMKS 0.709 0.707 0.708 img2txt precision-recall curve txt2img precision-recall curve

Online Asymmetric Similarity Learning for Cross-Modal ...openaccess.thecvf.com/content_cvpr_2017/poster/1727_POSTER.pdf · Online Asymmetric Similarity Learning for Cross-Modal Retrieval

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Online Asymmetric Similarity Learning for Cross-Modal ...openaccess.thecvf.com/content_cvpr_2017/poster/1727_POSTER.pdf · Online Asymmetric Similarity Learning for Cross-Modal Retrieval

IEEE 2017 Conference on

Computer Vision and Pattern

Recognition

Online Asymmetric Similarity Learning for Cross-Modal RetrievalYiling Wu1;2, Shuhui Wang1, Qingming Huang1;2

1Key Lab of Intell. Info. Process., Inst. of Comput. Tech, Chinese Academy of Sciences, China2University of Chinese Academy of Sciences, China

Motivations: The critical problem in cross-modal retrieval task is how to measure

the similarity between data from different modalities.

The relations between images and texts are highly asymmetric. There are two kinds of relative similarities that can be used.

CNN features are state-of-the-art features, but there are many CNN layers. Choosing which layer to use is a difficult problem.

Contributions: We propose an online learning method to learn the similarity

function between heterogeneous modalities by preserving the bi-directional relative similarity in the training data.

We extend it to an online multiple kernel learning method to address the problem of combining different layers of CNN features for cross-modal retrieval.

Learning Bi-direction Relative Similarity: Consider learning a bilinear similarity function

Bi-directional relative similarity constraints are indispensable for modeling the cross-modal relation.

We expect the similarity function to satisfy the following two conditions simultaneously:

The model is trained by the Passive-aggressive (PA) algorithm. We call this method Cross-Modal Online Similarity function learning (CMOS).

Framework of Training:

Online Multiple Kernel Learning:

To select multiple CNN layers, we derive its multiple kernel extension

which called Cross-Modal Online Multiple Kernel Similarity

function learning (CMOMKS).

We extend the primal model to its dual from. A pair of kernel is used

for each similarity function:

We want to learn the coefficients of linear combinations of the M pairs of kernels, while at the same time we learn each similarity function. Let , we consider:

At every iteration, for each of the M pairs of kernels, e.g., (𝑘𝑗𝑣; 𝑘𝑗

𝑡), we apply the PA algorithm to find the optimal coefficient for the kernelized similarity function, and then apply the Hedging algorithm to update the combination weight by

where 𝑧𝑗 𝑖 equals to 1 when 𝑠 𝑣𝑖; 𝑡𝑖+ − 𝑠 𝑣𝑖; 𝑡𝑖

− ≤ 0 or 𝑠 𝑣𝑖+; 𝑡𝑖 −

𝑠 𝑣𝑖−; 𝑡𝑖 ≤ 0, and 0 otherwise.

Mistake bound can be easily obtained according to the mistake bounds of the PA and the Hedge algorithm.

Experiment Setup: Datasets: Pascal VOC 2007 consists of annotated consumer

photographs collected from Flickr. After removing images without tags, we obtain a training set with 5000 images and a test set with 4919 images.

Features: Images are represented by ‘conv2’, ‘conv5’, ‘fc6’, ‘fc7’,‘prob’ CNN features. We concatenate CNN features for primal methods, and average kernel matrices for KCCA and KGMLDA. Texts are represented by tag occurrence features.

Performance:

Conclusions: We have proposed CMOS and its multiple kernel extension

CMOMKS to learn a similarity function between heterogeneous data modalities by preserving relative similarity constraints from two directions.

The CMOS online model is learned by the Passive-Aggressive algorithm. Multiple kernelized similarity function is further combined in CMOMKS by the Hedging algorithm.

Experimental results on three public cross-modal datasets have demonstrated that the proposed methods outperform state-of-the-art approaches.

S1θ1

S

S2θ2

SMθM

sample two

directional

triplets

The Icelandic is a "five-gaited" breed,

known for its sure-footedness and

ability to cross rough terrain. As well

as the typical gaits of walk, trot, and

canter/gallop, the breed …

Stanford Memorial Church is located at

the end of the mile-long axis of Stanford

University, visible from a distance; the

main vista begins at the main entrance,

continues to Palm Drive, traverses "the

Oval"After the defeat at Aspern-Essling,

Napoleon took more than six weeks in

planning and preparing for contingencies

before he made another attempt at

crossing the Danube.David G. Chandler,

''The Campaigns of Napoleon.

''Southern Cross'' returned to Cape Adare

from Australia on 28 January 1900.

Borchgrevink began the process of

dismantling the camp and transferring its

supplies to the ship

eparting from the main central

administrative system generally known as

the Three Departments and Six Ministries

system, which was instituted by various

dynasties since late Han (202 BCE - 220

CE), the Ming administration had only…

Vrba was criticized in 2001 in a series of

articles—‘’Leadership under Duress: The

Working Group in Slovakia, 1942–

1944‘’—edited by a group of leading

Israeli historians with ties to the Slovak

community…

Despite their size, ocean sunfish are

docile, and pose no threat to human

divers. Injuries from sunfish are rare,

although there is a slight danger from

large sunfish leaping out of the water onto

boats; in one instance …

The Icelandic is a "five-gaited" breed,

known for its sure-footedness and

ability to cross rough terrain. As well

as the typical gaits of walk, trot, and

canter/gallop, the breed …

Vrba was criticized in 2001 in a series of

articles—‘’Leadership under Duress: The

Working Group in Slovakia, 1942–

1944‘’—edited by a group of leading

Israeli historians with ties to the Slovak

community…

image features

text features

extract

text

features

extract

image

featuresonline update

C1:

𝑟 𝑣1, 𝑡1 > 𝑟(𝑣2, 𝑡1)

C3: 𝑟 𝑣2, 𝑡1 < 𝑟 𝑣2, 𝑡2

C2:

𝑟 𝑣1, 𝑡2 < 𝑟(𝑣2, 𝑡2)

C4: 𝑟 𝑣1, 𝑡1 < 𝑟(𝑣1, 𝑡2)

𝑡2

𝑡1

𝑣1

𝑣2

C1:

𝑠 𝑣1, 𝑡1 > 𝑠(𝑣2, 𝑡1)

C3: 𝑠 𝑣2, 𝑡1 < 𝑠 𝑣2, 𝑡2

C2:

𝑠 𝑣1, 𝑡2 < 𝑠(𝑣2, 𝑡2)

𝑡2

𝑡1

𝑣2

C4: 𝑠 𝑣1, 𝑡1 > 𝑠(𝑣1, 𝑡2)

𝑣1

method img2txt txt2img average

CCA 0.346 0.395 0.371

PLS 0.532 0.490 0.511

GMLDA 0.550 0.539 0.545

ml-CCA 0.584 0.572 0.578

LCFS 0.551 0.528 0.540

Bi-CMSRM 0.541 0.516 0.529

SSI 0.576 0.530 0.553

CMOS 0.586 0.600 0.593

KCCA 0.647 0.655 0.651

KGMLDA 0.675 0.676 0.676

CMOMKS 0.709 0.707 0.708

img2txt precision-recall curve

txt2img precision-recall curve