76
Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Embed Size (px)

Citation preview

Page 1: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Beyond Attributes -> Describing Images

Tamara L. BergUNC Chapel Hill

Page 2: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Descriptive Text“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns”

Scarlett O’Hara described in Gone with the Wind.

Berg, Attributes Tutorial CVPR13

Page 3: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

More Nuance than Traditional

Recognition…

car

shoe

person

Berg, Attributes Tutorial CVPR13

Page 4: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Toward Complex Structured Outputs

car

Berg, Attributes Tutorial CVPR13

Page 5: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Toward Complex Structured Outputs

pink car

Attributes of objects

Berg, Attributes Tutorial CVPR13

Page 6: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Toward Complex Structured Outputs

car on road

Relationships between objects

Berg, Attributes Tutorial CVPR13

Page 7: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Toward Complex Structured Outputs

Telling the “story of an image”

Little pink smart car parked on the side of a road in a London shopping district.

… Complex structured recognition outputs

Berg, Attributes Tutorial CVPR13

Page 8: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Learning from Descriptive Text

Visually descriptive language provides:• Information about the world, especially the visual world.• information about how people construct natural language for

imagery.• guidance for visual recognition. How do people

describe the world?

“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns”

Scarlett O’Hara described in Gone with the Wind.

How does theworld work?

What should we recognize?

Berg, Attributes Tutorial CVPR13

Page 9: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Methodology

Generation Methods:1) Compose descriptions directly from recognized content2) Retrieve relevant existing text given recognized content

Natural language description

A random Pink Smart Car seen driving around Lambeth Roundabout and onto Lambeth Bridge.

Smart Car. It was so adorable and cute in the parking lot of the post office, I had to stop and take a picture.

Pink CarSignDoorMotorcycleTreeBrick buildingDirty RoadSidewalkLondonShopping district

Berg, Attributes Tutorial CVPR13

Page 10: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Related Work

• Compose descriptions given recognized content Yao et al. (2010), Yang et al. (2011), Li et al. ( 2011), Kulkarni et al. (2011)

• Generation as retrieval Farhadi et al. (2010), Ordonez et al (2011), Gupta et al (2012), Kuznetsova et al (2012)

• Generation using pre-associated relevant text  Leong et al (2010), Aker and Gaizauskas (2010), Feng and Lapata (2010a)

• Other (image annotation, video description, etc) Barnard et al (2003), Pastra et al (2003), Gupta et al (2008), Gupta et al (2009), Feng and Lapata (2010b), del Pero et al (2011), Krishnamoorthy et al (2012), Barbu et al (2012),  Das et al (2013)

Berg, Attributes Tutorial CVPR13

Page 11: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Method 1: Recognize & Generate

Berg, Attributes Tutorial CVPR13

Page 12: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Baby Talk: Understanding and Generating Simple Image Descriptions

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg

CVPR 2011

Page 13: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 14: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 15: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 16: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 17: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 18: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 19: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 20: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 21: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 22: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11

Page 23: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Methodology• Vision -- detection and classification• Text inputs - statistics from parsing lots of

descriptive text• Graphical model (CRF) to predict best image

labeling given vision and text inputs• Generation algorithms to generate natural

language

Kulkarni et al, CVPR11

Page 24: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Vision is hard!

World knowledge (from descriptive text) can be used to smooth noisy vision predictions!

Green sheep

Kulkarni et al, CVPR11

Page 25: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Methodology• Vision -- detection and classification• Text -- statistics from parsing lots of descriptive

text• Graphical model (CRF) to predict best image

labeling given vision and text inputs• Generation algorithms to generate natural

language

Kulkarni et al, CVPR11

Page 26: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Learning from Descriptive Text

Attributes

Relationships

green green grass by the lakea very shiny car in the car museum in my hometown of upstate NY.

Our cat Tusik sleeping on the sofa near a hot radiator.

very little person in a big rocking chair Kulkarni et al, CVPR11

Page 27: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Methodology• Vision -- detection and classification• Text -- statistics from parsing lots of descriptive

text• Model (CRF) to predict best image labeling given

vision and text based potentials• Generation algorithms to compose natural

language

Kulkarni et al, CVPR11

Page 28: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

System Flow

Input Image

Extract Objects/stuff

a) dog

b) person

c) sofa

brown 0.32striped 0.09furry .04wooden .2Feathered .04 ...

brown 0.94striped 0.10furry .06wooden .8Feathered .08 ...

brown 0.01striped 0.16furry .26wooden .2feathered .06 ...

a) dog

b) person

c) sofa

Predict attributesPredict prepositions

a) dog

b) person

c) sofa

near(a,b) 1 near(b,a) 1 against(a,b) .11against(b,a) .04 beside(a,b) .24beside(b,a) .17 ...

near(a,c) 1 near(c,a) 1 against(a,c) .3against(c,a) .05 beside(a,c) .5beside(c,a) .45 ...

near(b,c) 1 near(c,b) 1 against(b,c) .67against(c,b) .33 beside(b,c) .0beside(c,b) .19 ...

Predict labeling – vision potentials smoothed with text potentials

<<null,person_b>,against,<brown,sofa_c>> <<null,dog_a>,near,<null,person_b>> <<null,dog_a>,beside,<brown,sofa_c>>

Generate natural language description

This is a photograph of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa.

Kulkarni et al, CVPR11

Page 29: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road.

Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky.

Some good results

This is a picture of two dogs. The first dog is near the second furry dog.

Kulkarni et al, CVPR11

Page 30: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Some bad results

Here we see one potted plant.

Missed detections:

This is a picture of one dog.

False detections:

There are one road and one cat. The furry road is in the furry cat.

This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the red road.

This is a photograph of two sheeps and one grass. The first black sheep is by the green grass, and by the second black sheep. The second black sheep is by the green grass.

Incorrect attributes:

This is a photograph of two horses and one grass. The first feathered horse is within the green grass, and by the second feathered horse. The second feathered horse is within the green grass. Kulkarni et al, CVPR11

Page 31: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Algorithm vs Humans

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

H1: A Lemonaide stand is manned by a blonde child with a cookie. H2: A small child at a lemonade and cookie stand on a city corner. H3: Young child behind lemonade stand eating a cookie.

Sounds unnatural!

Kulkarni et al, CVPR11

Page 32: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Method 2: Retrieval based generation

Berg, Attributes Tutorial CVPR13

Page 33: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Every picture tells a story,

describing images with

meaningful sentences

Ali Farhadi, Mohsen Hejrati, Amin Sadeghi, Peter Young, Cyrus Rashtchian,

Julia Hockenmaier, David Forsyth ECCV 2010

Slides provided by Ali Farhadi

Page 34: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

A Simplified ProblemRepresent image/text content as subject-verb-scene triple

Good triples:• (ship, sail, sea)• (boat, sail, river)• (ship, float, water)

Bad triples:• (boat, smiling, sea) – bad relations• (train, moving, rail) – bad words• (dog, speaking, office) - both

Farhadi et al, ECCV10

Page 35: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

The Expanded Model

• Map from Image Space to Meaning Space

• Map from Sentence Space to Meaning Space

• Retrieve Sentences for Images via Meaning SpaceFarhadi et al, ECCV10

Page 36: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Retrieval through meaning space

• Map from Image Space to Meaning Space

• Map from Sentence Space to Meaning Space

• Retrieve Sentences for Images via Meaning SpaceFarhadi et al, ECCV10

Page 37: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Image Space Meaning Space

Predict Image Content using trained classifiers

Farhadi et al, ECCV10

Page 38: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Retrieval through meaning space

• Map from Image Space to Meaning Space

• Map from Sentence Space to Meaning Space

• Retrieve Sentences for Images via Meaning SpaceFarhadi et al, ECCV10

Page 39: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Sentence Space Meaning Space

• Extract subject, verb and scene from sentences in the training data

Subject: CatVerb: SittingScene: room

black cat over pink chairA black color cat sitting on chair in a room.cat sitting on a chair looking in a mirror.

Vehicle

Car TrainBike

HumanAnimal

Cat HorseDog

Object

• Use taxonomy trees

Farhadi et al, ECCV10

Page 40: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Retrieval through meaning space

• Map from Image Space to Meaning Space

• Map from Sentence Space to Meaning Space

• Retrieve Sentences for Images via Meaning SpaceFarhadi et al, ECCV10

Page 41: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Farhadi et al, ECCV10

Page 42: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Farhadi et al, ECCV10

Page 43: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Farhadi et al, ECCV10

Page 44: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Data

Rashtchian et al 2010, Farhadi et al 20105 descriptions per image 20 object categories

Image-Clef challenge2 descriptions per image Select image categories

Large amounts of paired data can help us study the image-language relationship

1,000 images 20,000 images

More data needed?

Berg, Attributes Tutorial CVPR13

Page 45: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Data exists, but buried in junk!

Through the smoke Duna Portrait #5

Mirror and gold the cat lounging in the sink

Berg, Attributes Tutorial CVPR13

Page 46: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

SBU Captioned Photo Datasethttp://tamaraberg.com/sbucaptions

Our dog Zoe in her bed

Interior design of modern white and brown living room furniture against white wall with a lamp hanging.

The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon

Man sits in a rusted car buried in the sand on Waitarere beach

Emma in her hat looking super cute

Little girl and her dog in northern Thailand. They both seemed interested in what we were doing

1 million

captioned

photos!

1 milli

on

captio

ne

d photo

s!

Berg, Attributes Tutorial CVPR13

Page 47: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“Im2Text: Describing Images Using

1 Million Captioned Photographs”

Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

NIPS 2011

Page 48: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Big Data Driven Generation

One of the many stone bridges in town that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Generate natural sounding descriptions using existing captions

Ordonez et al, NIPS11

Page 49: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Harness the Web!

Smallest house in paris between red (on right) and beige (on left).

Bridge to temple in Hoan Kiem lake.

The water is clear enough to see fish swimming around in it.

A walk around the lake near our house with Abby.

Hangzhou bridge in West lake.

The daintree river by boat.…

SBU Captioned Photo Dataset

Transfer Caption(s)

Global Matching(GIST + Color)

e.g. “The water is clear enough to see fish swimming around in it.”

1 million captioned images!

Ordonez et al, NIPS11

Page 50: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Use High Level Content to Rerank (Objects, Stuff, People, Scenes, Captions)

The bridge over the lake on Suzhou Street.

The Daintree river by boat.

Bridge over Cacapon river.

Iron bridge over the Duck river.

. . .

Transfer Caption(s)

e.g. “The bridge over the lake on Suzhou Street.”

Ordonez et al, NIPS11

Page 51: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Results

Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind.

Fresh fruit and vegetables at the market in Port Louis Mauritius.

A female Mallard duck in the lake at Luukki Espoo.

Cat in sink.

Good

The cat in the window.

The boat ended up a kilometre from the water in the middle of the airstrip.

Bad

Ordonez et al, NIPS11

Page 52: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Next….Composing novel captions from pieces of existing ones

Berg, Attributes Tutorial CVPR13

Page 53: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Composing captionsguessing game

a) monkey playing in the tree canopy, Monte Verde in the rain forest

e) the monkey sitting in a tree, posing for his picture

c) monkey spotted in Apenheul Netherlands under the tree

d) a white-faced or capuchin in the tree in the garden

b) capuchin monkey in front of my window

Berg, Attributes Tutorial CVPR13

Page 54: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Composing captionsguessing game

a) monkey playing in the tree canopy, Monte Verde in the rain forest

e) the monkey sitting in a tree, posing for his picture

c) monkey spotted in Apenheul Netherlands under the tree

d) a white-faced or capuchin in the tree in the garden

b) capuchin monkey in front of my window

Berg, Attributes Tutorial CVPR13

Page 55: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

“Collective Generation of Natural Image Descriptions”

Polina Kuznetsova, Vicente Ordonez,

Alexander C. Berg,Tamara L. Berg and Yejin

Choi

ACL 2012

Page 56: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Composing Descriptions

the dirty sheep meandered along a desolate road in the highlands of Scotland through frozen grass

NP: the dirty sheep

VP: meandered along a desolate road

PP: in the highlands of Scotland

PP: through frozen grass

Object appearance

Object pose

Scene appearance

Region appearance & relationship

Example Composed Description:

Kuznetsova et al, ACL12

Page 57: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

SBU Captioned Photo Datasethttp://tamaraberg.com/sbucaptions

Our dog Zoe in her bed

Interior design of modern white and brown living room furniture against white wall with a lamp hanging.

The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon

Man sits in a rusted car buried in the sand on Waitarere beach

Emma in her hat looking super cute

Little girl and her dog in northern Thailand. They both seemed interested in what we were doing

1 million

captioned

photos!

1 milli

on

captio

ne

d photo

s!

Ordonez et al, NIPS11

Page 58: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Data Processing

1,000,000 images:oRun object detectorsoRun region based stuff detectors (grass, sky,

etc.)oRun global scene classifierso Parse captions associated with images and

retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene).

Kuznetsova et al, ACL12

Page 59: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Image Description Generation

Generation

Objects, Actions, Stuff, Scenes

Phrase Retrieval

Description

Computer Vision

Kuznetsova et al, ACL12

Page 60: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Image Description Generation

Generation

Objects, Actions, Stuff, Scenes

Phrase Retrieval

Description

Computer Vision

Kuznetsova et al, ACL12

Page 61: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

this dog was laying in the middle of the road on a back street in jaco

Closeup of my dog sleeping under my desk.

Detect: dog

Find matching detections by pose similarity

Peruvian dog sleeping on city street in the city of Cusco, (Peru)

Contented dog just laying on the edge of the road in front of a house..

Retrieving VPs

Kuznetsova et al, ACL12

Page 62: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Retrieving NPs

Detect: fruit

Find matching detections by appearance similarity

Tray of glace fruit in the market at Nice, France

Fresh fruit in the market

A box of oranges was just catching the sun, bringing out detail in the skin.

The street market in Santanyi, Mallorca is a must for the oranges and local crafts.

An orange tree in the backyard of the house.

mandarin oranges in glass bowl

Kuznetsova et al, ACL12

Page 63: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Find matching regions by appearance + arrangement similarity

Mini Nike soccer ball all alone in the grass

Comfy chair under a tree.

I positioned the chairs around the lemon tree -- it's like a shrine

Cordoba - lonely elephant under an orange tree...

Retrieving PPstuff

Detect: stuff Kuznetsova et al, ACL12

Page 64: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Retrieving PPscene

View from our B&B in this photo

Extract scene descriptor

Find matching images by global scene similarity

Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere

I'm about to blow the building across the street over with my massive lung power.

Only in Paris will you find a bottle of wine on a table outside a bookstore

Kuznetsova et al, ACL12

Page 65: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Image Description Generation

Objects, Actions, Stuff, Scenes

Phrase Retrieval

Computer Vision

Generation

Description

Kuznetsova et al, ACL12

Page 66: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Object NPs

Actions VPs

Scene PPs

Stuff PPs

birdsthe bird

birds over water are standing

in the ocean

Position 1

Position 2

Position 3

Position 4

are standinglooking for foodin waterover water

in the oceannear Salt Pond

Kuznetsova et al, ACL12

Page 67: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Possible Assignments

birds

Position1

Position2

Position3

Position4

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

Kuznetsova et al, ACL12

Page 68: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Possible AssignmentsPosition

1Position

2Position

3Position

4

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

Kuznetsova et al, ACL12

Page 69: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Possible AssignmentsPosition

1Position

2Position

3Position

4

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

Kuznetsova et al, ACL12

Page 70: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Position1

Position2

Position3

Position4

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

Phrases of the Same Type

Kuznetsova et al, ACL12

Page 71: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Position1

Position2

Position3

Position4

birds

the bird

are standing

in the ocean

are standing

the bird

birds

in the ocean

birds

the bird

are standing

in the ocean

birds

the bird

are standing

in the ocean

Singular/Plural Relationships

Kuznetsova et al, ACL12

Page 72: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

ILP Optimization

Vision scoreso Visual detection/classification scores

Phrase cohesion o n-gram statistics between phraseso Co-occurrence statistics between phrase head

words

Linguistic constraints o Allow at most one phrase of each typeo Enforce plural/singular agreement between NP

and VP

Discourse constraintso Prevent inclusion of repeated phrasing

Optimize for:

Subject to:

Kuznetsova et al, ACL12

Page 73: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

This is a sporty little red

convertible made for a great day in

Key West FL. This car was in

the 4th parade of the apartment

buildings.

Good Examples

This is a brass viking boat moored on beach in Tobago by the ocean.

The clock made in Korea.

Kuznetsova et al, ACL12

Page 74: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Visual Turing Test

In some cases (16%), ILP generated captions were preferred over human written ones!

Us vs Original Human Written Caption

Kuznetsova et al, ACL12

Page 75: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Grammatically Incorrect

Cognitive Absurdity

This is a shoulder bag

with a blended rainbow effect.

Not Relevant

Here you can see a cross by the

frog in the sky.

One of the most shirt in the wall

of the house.

Computer VisionError

Bad Results

Kuznetsova et al, ACL12

Page 76: Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Questions?