38
FROM IMAGES TO SENTENCES SCENE DESCRIPTION KHUSHALI ACHARYA 1517MECE30008

scene description

Embed Size (px)

Citation preview

Page 1: scene description

FROM IMAGES TO SENTENCES

SCENE DESCRIPTION

KHUSHALI ACHARYA1517MECE30008

Page 2: scene description

AGENDA

1 Introduction

2 Motivation

3 Related research work

4 Our Approach

5 Conclusion

Page 3: scene description

INTRODUCTION TO SCENE DESCRIPTION

1

Page 4: scene description

WHAT IS IT?

“Interpreting images and generating sentences. ”

“Scene interpretation means understanding every-day occurrences or recognizing rare events.”

“Scene interpretations are Controlled Hallucinations.”

Page 5: scene description

WHY DO WE NEED SUCH SYSTEMS?

Isn't a picture enough to depict the things clearly?

Page 6: scene description

Ever imagined a cricket match without commentary? Or A movie without dialogues?

Page 7: scene description

It has been estimated that more than 80% of the activities we do online are

text-based.

Page 8: scene description

A laymen can’t understand the medical reports unless the doctor makes him understand or it is in written form.

Medical reports

Page 9: scene description

Tagged location becomes clear if the place name is described.

Page 10: scene description

Thus we saw that the description of an image adds an interestingness measure

to it.

Page 11: scene description

MOTIVATION TO THE PROBLEM

2

Page 12: scene description

WHAT ARE THE REAL TIME APPLICATIONS?

2.1

Page 13: scene description

• Self-aware cognitive robots• Assist Visually impaired people• Soccer game analysis• Image search/ retrieval systems• Street traffic observations• Criminal act recognition• Agricultural sector

Page 14: scene description

Some Applications of scene description

Page 15: scene description

HOW IT IS HELPFUL TO SOCIETY?

2.2

Page 16: scene description

Assists Visually Impaired People

Screen Reader

Screen Readers are software programs that convert text into synthesized speech and blind people are able to listen to web content.

LIMITATIONS:

• Screen readers cannot describe images.• Screen readers cannot survey the entirety of a web page as a visual user might do. It cannot always intelligently skip over extraneous content, such as advertisements or navigation bars.

Page 17: scene description

Scene description can be helpful to blind people in this manner.

Page 18: scene description

Images are captured and unusual activities are recorded.

The features of images are extracted which thus help in crime investigation.

Criminal Act Recognition

Page 19: scene description

Efficient and consistent scene interpretation is a

prerequisite self aware cognitive robots to work.

Human Computer Interaction

Object Recognition and scene interpretation

Spatial Relation Extraction

Page 20: scene description

RELATED EXISTING RESEARCH WORK

3

Page 21: scene description

RESEARCH PAPERS

3.1

Page 22: scene description

Sr. No. PAPER PROPOSED DATASET CONCLUSION

1. “Midge: Generatingimage descriptions fromcomputer visiondetection”U. of Aberdeen and Oregon Healthand Science University, Stony BrookUniversity, U. of Maryland, ColumbiaUniversity, U. of Washington, MIT.

This paper introduces anovel generationsystem that composeshumanlike descriptionsof images from computervision detections.

For training:700,000(Flickr, 2011) imageswithassociateddescriptions from thedataset in Ordonezet al. (2011).For evaluation:840PASCALimages.

Midge generates a well-formed description of animage by filtering attributedetections that are unlikelyand placing objects into anordered syntacticstructure.

2. “Every picture tells a

story: Generating

sentences

from images.”Farhadi, A., Hejrati, S. M. M.,

Sadeghi, M. A., Young, P.,

Rashtchian, C., Hockenmaier, J., and

Forsyth, D. A.

(2010). Springer.

attempts

to “generate” sentences by

first learning from

a set of human annotated

examples, and producing

the same sentence if both

images and sentence

share common properties

in terms of their triplets:

(Nouns-Verbs-Scenes).

PASCAL 2008 imageswith humanannotation

Sentences are rich, compactand subtle representationsof information. Even so, wecan predict good sentencesfor images that people like.The intermediate meaningrepresentation is one keycomponent in our model asit allows benefiting fromdistributional semantics.

Page 23: scene description

Sr. No. PAPER PROPOSED DATASET CONCLUSION

3. “Babytalk: Understandingand generating simpleimage descriptions”G Kulkarni, V Premraj, V Ordonez.IEEE TRANSACTIONS ON PATTERNANALYSIS AND MACHINEINTELLIGENCE, VOL. 35, NO. 12,DECEMBER 2013

It uses detector for objectscene detection and makequadruplet: (Nouns-Verbs-Scenes-preposition).

PASCAL 2008images

Human-forced choice

experiments

demonstrate the quality of the

generated sentences

over previous approaches. One

key to the success of our

system was automatically mining

and parsing large text

collections to obtain statistical

models for visually descriptive

language.

4. “Choosing Linguistics over

Vision to Describe

Images”Ankush Gupta, Yashaswi Verma, C.V. JawaharInternational Institute ofInformation Technology, Hyderabad,India – 500032

Problem of automaticallygenerating human-likedescriptions for unseenimages,given a collection ofimages and theircorrespondinghuman-generateddescriptions.

PASCAL dataset They proposed a novel approachfor generating relevant, fluentand human-like descriptions forimages without relyingon any object detectors,classifiers, hand-written rules orheuristics.

Page 24: scene description

THEIR APPROACH

3.2

Page 25: scene description

1.) Choosing Linguistics over Vision to Describe Images

i. Given an unseen image,

ii. find K images most similar to it from the training images,and using the phrases extracted from their descriptions

iii. generate a ranked list of triples which is then used tocompose description for the new image.

Page 26: scene description

i. input image ii.) Neighboring images with extracted phrases

iii.) Triple section and sentence generation

Page 27: scene description

FAILURE SCENARIO

A motor racer is speedingthrough a splash mud.

A water cow is grazingalong a roadside.

An orange fixture is hangingin a messy kitchen.

Page 28: scene description

OUR APPROACH

4

Page 29: scene description

USING

OPEN CV , NLP & SVM

4.1

Page 30: scene description

OPEN CV

• Open source computer vision and machine learning software library.

• More than 2500 optimized algorithms.

• C++, C, Python, Java and MATLAB interfaces

• Supports Windows, Linux, Android and Mac OS

Page 31: scene description

NLP(Natural Language Processing)

It is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions

between computers and human (natural) languages.

Page 32: scene description

SVM(Support Vector Machines)

A discriminative classifier formallydefined by a separatinghyperplane. In other words, givenlabeled training data (supervisedlearning), the algorithm outputs anoptimal hyperplane whichcategorizes new examples.

Page 33: scene description

SYSTEM FLOW

4.2

Page 34: scene description

Take query image as input

Detect objects from query image

Corpus data & extract shortest sentences

RDF (Resource Description Framework)Parser

<object1,predicate1,object2>

Google image API

retrieve top 10 images

Match query image and compute score for each retrieval image

Highest score images are our matching triplet

Page 35: scene description

EXPERIMENTAL SETUP

4.3

Page 36: scene description

DATA SET

PASCAL (Pattern Analysis, Statistical Modeling and Computational Learning)

It provides standardized image data sets for object class recognition

Technology

JAVA/PYTHON

Page 37: scene description

Thus we saw the fundamentalsof scene description, itsapplications, previous work inthis field and our approach fordesigning this system.

5. CONCLUSION

Page 38: scene description

THANK YOU!