Learning from Descriptive Text Tamara L Berg Stony Brook University

Learning from Descriptive Text

Tamara L BergStony Brook University

Vision

Humans

Language

Tags: canon, eos, macro, japan, vacation, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo, art, light, photo, flickr, blurry, favorite, nice.

It's the perfect party dress. With distinctly feminine details such as a wide sash bow around an empire waist and a deep scoopneck, this linen dress will keep you comfortable and feeling elegant all evening long.

Visually Descriptive Text

Visually descriptive language provides:• information about how people construct natural language for

imagery.• information about the world, especially the visual world.• guidance for computational visual recognition.

How do peopledescribe the world?

“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” –

Gone with the Wind

How does theworld work?

What should we recognize?






from Gone with the Wind by Margaret Mitchell



What’s in a description?

What do people describe?“A bearded man is holding a child in a sling.”

manbabyslingshirtglassesladderfridgetablewatermelonchairboxescupswater bottlewallpacifierbeard…

“A bearded man stands while holding a small child in a green sheet.” “A bearded man with a baby in a sling poses.”“Man standing in kitchen with little girl in green sack.” “Man with beard and baby”

What’s in this image?

What’s in a description?

Predict what people will describeGiven an image

“looking for castles in the clouds out my car window”

1)

2)

Given a caption

“two women sitting brunette blonde on bench reading magazine”

Predict what’s in the image

clouds ✔car

window

castle

✖✖?

women ✔bench ✔

magazine ✔grass

skirt✖✖…

e.g. Spain & Perona, 2010

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Who’s in the picture?T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth

Model Accuracy of labeling

Vision model, No Lang model

67%

Vision model + Lang model 78%









Vision is hard

World knowledge (from descriptive text) can be used to smooth noisy vision predictions!

Green sheep

Learning World Knowledge

Attributes

Relationships

green green grass by the lake

a very shiny car in the car museum in my hometown of upstate NY.

Our cat Tusik sleeping on the sofa near a hot radiator.

very little person in a big rocking chair

BabyTalk: Understanding and Generating Simple Image DescriptionsKulkarni, Premraj, Dhar, Li, Choi, AC Berg, TL Berg, CVPR 2011

System Flow

Input Image

Extract Objects/stuff

a) dog

b) person

c) sofa

brown 0.32striped 0.09furry .04wooden .2Feathered .04 ...

brown 0.94striped 0.10furry .06wooden .8Feathered .08 ...

brown 0.01striped 0.16furry .26wooden .2feathered .06 ...

a) dog

b) person

c) sofa

Predict attributesPredict prepositions

a) dog

b) person

c) sofa

near(a,b) 1 near(b,a) 1 against(a,b) .11against(b,a) .04 beside(a,b) .24beside(b,a) .17 ...

near(a,c) 1 near(c,a) 1 against(a,c) .3against(c,a) .05 beside(a,c) .5beside(c,a) .45 ...

near(b,c) 1 near(c,b) 1 against(b,c) .67against(c,b) .33 beside(b,c) .0beside(c,b) .19 ...

Predict labeling – vision potentials smoothed with text potentials

<<null,person_b>,against,<brown,sofa_c>> <<null,dog_a>,near,<null,person_b>> <<null,dog_a>,beside,<brown,sofa_c>>

Generate natural language description

This is a photograph of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa.

This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road.

Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky.

BabyTalk results

This is a picture of two dogs. The first dog is near the second furry dog.

Objects, Attributes, Prepositions










• Recognition is beginning to work

• Open question – what should we recognize?

• Maybe objects aren’t (always) the right base level entities

Object Recognition

Parts, Poselets, AttributesFor example: [Fergus, Perona, Zisserman2003],[Bourdev, Malik2009], …

Slide Credit: Ali Farhadi

Learn which terms in descriptions are depictable

Fully beaded with megawatt crystals, this Christian Louboutin suede pump matches the gleam in your eye.

Pump's linear heel plays up the alluring curves of its dipped sides.

Round toe frames low-cut vamp.Tonally topstitched collar.

4" straight, covered heel shows off signature red sole.

Creamy leather lining with padded insole."Fifi" is made in Italy.

attributes

Automatically Discovering Attributes from Noisy Web Data T.L. Berg, A.C. Berg, J. Shih ECCV 2010

Given Web Images + Noisy Text Descriptions: 1) Discover visual attribute terms in text descriptions - likely domain dependent 2) Learn appearance models for attributes without labeled data 3) Characterize attributes by: type, localizability

Object Recognition

For example: [Oliva, Torralba 2001],[SUN 2010], …

Slide Credit: Ali Farhadi

Scenes

What are the right quanta of Recognition?

Farhadi & SadeghiRecognition using Visual Phrases , CVPR 2011

Participating in Phrases Profoundly affects the appearance of objects

Farhadi & SadeghiRecognition using Visual Phrases , CVPR 2011

Maybe descriptive text can inform entity hypotheses!

“the dog is sleeping”

“sleeping dog in delhi”“A dog is sleeping in”

“a sleeping dog in NTHU”


“cat in a bag”

“cat in the bag”“cat in bag”

“the cat is in the bag”


Maybe descriptive text can inform entity hypotheses!

Conclusion

Use large pools of descriptive text to:

Learn how people describe the visual world

Learn how the world works

Guide future efforts in recognition

Apply this knowledge to multi-modal

collections & applications

Acknowledgements

• Collaborators: Alex Berg, David Forsyth, Jaety Edwards, Jonathan Shih, Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Vicente Ordonez, Siming Li, Yejin Choi, Kota Yamaguchi, Vicente Ordonez

• Funded by NSF Faculty Early Career Development (CAREER) Program: Award #1054133

http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1054133&WT.z_pims_id=503214

Documents

Learning from Descriptive Text Tamara L Berg Stony Brook University