scene description

FROM IMAGES TO SENTENCES

SCENE DESCRIPTION

KHUSHALI ACHARYA1517MECE30008

AGENDA

1 Introduction

2 Motivation

3 Related research work

4 Our Approach

5 Conclusion

INTRODUCTION TO SCENE DESCRIPTION

1

WHAT IS IT?

“Interpreting images and generating sentences. ”

“Scene interpretation means understanding every-day occurrences or recognizing rare events.”

“Scene interpretations are Controlled Hallucinations.”

WHY DO WE NEED SUCH SYSTEMS?

Isn't a picture enough to depict the things clearly?

Ever imagined a cricket match without commentary? Or A movie without dialogues?

It has been estimated that more than 80% of the activities we do online are

text-based.

A laymen can’t understand the medical reports unless the doctor makes him understand or it is in written form.

Medical reports

Tagged location becomes clear if the place name is described.

Thus we saw that the description of an image adds an interestingness measure

to it.

MOTIVATION TO THE PROBLEM

2

WHAT ARE THE REAL TIME APPLICATIONS?

2.1

• Self-aware cognitive robots• Assist Visually impaired people• Soccer game analysis• Image search/ retrieval systems• Street traffic observations• Criminal act recognition• Agricultural sector

Some Applications of scene description

HOW IT IS HELPFUL TO SOCIETY?

2.2

Assists Visually Impaired People

Screen Reader

Screen Readers are software programs that convert text into synthesized speech and blind people are able to listen to web content.

LIMITATIONS:

• Screen readers cannot describe images.• Screen readers cannot survey the entirety of a web page as a visual user might do. It cannot always intelligently skip over extraneous content, such as advertisements or navigation bars.

Scene description can be helpful to blind people in this manner.

Images are captured and unusual activities are recorded.

The features of images are extracted which thus help in crime investigation.

Criminal Act Recognition

Efficient and consistent scene interpretation is a

prerequisite self aware cognitive robots to work.

Human Computer Interaction

Object Recognition and scene interpretation

Spatial Relation Extraction

RELATED EXISTING RESEARCH WORK

3

RESEARCH PAPERS

3.1

Sr. No. PAPER PROPOSED DATASET CONCLUSION

1. “Midge: Generatingimage descriptions fromcomputer visiondetection”U. of Aberdeen and Oregon Healthand Science University, Stony BrookUniversity, U. of Maryland, ColumbiaUniversity, U. of Washington, MIT.

This paper introduces anovel generationsystem that composeshumanlike descriptionsof images from computervision detections.

For training:700,000(Flickr, 2011) imageswithassociateddescriptions from thedataset in Ordonezet al. (2011).For evaluation:840PASCALimages.

Midge generates a well-formed description of animage by filtering attributedetections that are unlikelyand placing objects into anordered syntacticstructure.

2. “Every picture tells a

story: Generating

sentences

from images.”Farhadi, A., Hejrati, S. M. M.,

Sadeghi, M. A., Young, P.,

Rashtchian, C., Hockenmaier, J., and

Forsyth, D. A.

(2010). Springer.

attempts

to “generate” sentences by

first learning from

a set of human annotated

examples, and producing

the same sentence if both

images and sentence

share common properties

in terms of their triplets:

(Nouns-Verbs-Scenes).

PASCAL 2008 imageswith humanannotation

Sentences are rich, compactand subtle representationsof information. Even so, wecan predict good sentencesfor images that people like.The intermediate meaningrepresentation is one keycomponent in our model asit allows benefiting fromdistributional semantics.

Sr. No. PAPER PROPOSED DATASET CONCLUSION

3. “Babytalk: Understandingand generating simpleimage descriptions”G Kulkarni, V Premraj, V Ordonez.IEEE TRANSACTIONS ON PATTERNANALYSIS AND MACHINEINTELLIGENCE, VOL. 35, NO. 12,DECEMBER 2013

It uses detector for objectscene detection and makequadruplet: (Nouns-Verbs-Scenes-preposition).

PASCAL 2008images

Human-forced choice

experiments

demonstrate the quality of the

generated sentences

over previous approaches. One

key to the success of our

system was automatically mining

and parsing large text

collections to obtain statistical

models for visually descriptive

language.

4. “Choosing Linguistics over

Vision to Describe

Images”Ankush Gupta, Yashaswi Verma, C.V. JawaharInternational Institute ofInformation Technology, Hyderabad,India – 500032

Problem of automaticallygenerating human-likedescriptions for unseenimages,given a collection ofimages and theircorrespondinghuman-generateddescriptions.

PASCAL dataset They proposed a novel approachfor generating relevant, fluentand human-like descriptions forimages without relyingon any object detectors,classifiers, hand-written rules orheuristics.

THEIR APPROACH

3.2

1.) Choosing Linguistics over Vision to Describe Images

i. Given an unseen image,

ii. find K images most similar to it from the training images,and using the phrases extracted from their descriptions

iii. generate a ranked list of triples which is then used tocompose description for the new image.

i. input image ii.) Neighboring images with extracted phrases

iii.) Triple section and sentence generation

FAILURE SCENARIO

A motor racer is speedingthrough a splash mud.

A water cow is grazingalong a roadside.

An orange fixture is hangingin a messy kitchen.

OUR APPROACH

4

USING

OPEN CV , NLP & SVM

4.1

OPEN CV

• Open source computer vision and machine learning software library.

• More than 2500 optimized algorithms.

• C++, C, Python, Java and MATLAB interfaces

• Supports Windows, Linux, Android and Mac OS

NLP(Natural Language Processing)

It is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions

between computers and human (natural) languages.

SVM(Support Vector Machines)

A discriminative classifier formallydefined by a separatinghyperplane. In other words, givenlabeled training data (supervisedlearning), the algorithm outputs anoptimal hyperplane whichcategorizes new examples.

SYSTEM FLOW

4.2

Take query image as input

Detect objects from query image

Corpus data & extract shortest sentences

RDF (Resource Description Framework)Parser

<object1,predicate1,object2>

Google image API

retrieve top 10 images

Match query image and compute score for each retrieval image

Highest score images are our matching triplet

EXPERIMENTAL SETUP

4.3

DATA SET

PASCAL (Pattern Analysis, Statistical Modeling and Computational Learning)

It provides standardized image data sets for object class recognition

Technology

JAVA/PYTHON

Thus we saw the fundamentalsof scene description, itsapplications, previous work inthis field and our approach fordesigning this system.

5. CONCLUSION

THANK YOU!