View
192
Download
0
Category
Preview:
Citation preview
FROM IMAGES TO SENTENCES
SCENE DESCRIPTION
KHUSHALI ACHARYA1517MECE30008
AGENDA
1 Introduction
2 Motivation
3 Related research work
4 Our Approach
5 Conclusion
INTRODUCTION TO SCENE DESCRIPTION
1
WHAT IS IT?
“Interpreting images and generating sentences. ”
“Scene interpretation means understanding every-day occurrences or recognizing rare events.”
“Scene interpretations are Controlled Hallucinations.”
WHY DO WE NEED SUCH SYSTEMS?
Isn't a picture enough to depict the things clearly?
Ever imagined a cricket match without commentary? Or A movie without dialogues?
It has been estimated that more than 80% of the activities we do online are
text-based.
A laymen can’t understand the medical reports unless the doctor makes him understand or it is in written form.
Medical reports
Tagged location becomes clear if the place name is described.
Thus we saw that the description of an image adds an interestingness measure
to it.
MOTIVATION TO THE PROBLEM
2
WHAT ARE THE REAL TIME APPLICATIONS?
2.1
• Self-aware cognitive robots• Assist Visually impaired people• Soccer game analysis• Image search/ retrieval systems• Street traffic observations• Criminal act recognition• Agricultural sector
Some Applications of scene description
HOW IT IS HELPFUL TO SOCIETY?
2.2
Assists Visually Impaired People
Screen Reader
Screen Readers are software programs that convert text into synthesized speech and blind people are able to listen to web content.
LIMITATIONS:
• Screen readers cannot describe images.• Screen readers cannot survey the entirety of a web page as a visual user might do. It cannot always intelligently skip over extraneous content, such as advertisements or navigation bars.
Scene description can be helpful to blind people in this manner.
Images are captured and unusual activities are recorded.
The features of images are extracted which thus help in crime investigation.
Criminal Act Recognition
Efficient and consistent scene interpretation is a
prerequisite self aware cognitive robots to work.
Human Computer Interaction
Object Recognition and scene interpretation
Spatial Relation Extraction
RELATED EXISTING RESEARCH WORK
3
RESEARCH PAPERS
3.1
Sr. No. PAPER PROPOSED DATASET CONCLUSION
1. “Midge: Generatingimage descriptions fromcomputer visiondetection”U. of Aberdeen and Oregon Healthand Science University, Stony BrookUniversity, U. of Maryland, ColumbiaUniversity, U. of Washington, MIT.
This paper introduces anovel generationsystem that composeshumanlike descriptionsof images from computervision detections.
For training:700,000(Flickr, 2011) imageswithassociateddescriptions from thedataset in Ordonezet al. (2011).For evaluation:840PASCALimages.
Midge generates a well-formed description of animage by filtering attributedetections that are unlikelyand placing objects into anordered syntacticstructure.
2. “Every picture tells a
story: Generating
sentences
from images.”Farhadi, A., Hejrati, S. M. M.,
Sadeghi, M. A., Young, P.,
Rashtchian, C., Hockenmaier, J., and
Forsyth, D. A.
(2010). Springer.
attempts
to “generate” sentences by
first learning from
a set of human annotated
examples, and producing
the same sentence if both
images and sentence
share common properties
in terms of their triplets:
(Nouns-Verbs-Scenes).
PASCAL 2008 imageswith humanannotation
Sentences are rich, compactand subtle representationsof information. Even so, wecan predict good sentencesfor images that people like.The intermediate meaningrepresentation is one keycomponent in our model asit allows benefiting fromdistributional semantics.
Sr. No. PAPER PROPOSED DATASET CONCLUSION
3. “Babytalk: Understandingand generating simpleimage descriptions”G Kulkarni, V Premraj, V Ordonez.IEEE TRANSACTIONS ON PATTERNANALYSIS AND MACHINEINTELLIGENCE, VOL. 35, NO. 12,DECEMBER 2013
It uses detector for objectscene detection and makequadruplet: (Nouns-Verbs-Scenes-preposition).
PASCAL 2008images
Human-forced choice
experiments
demonstrate the quality of the
generated sentences
over previous approaches. One
key to the success of our
system was automatically mining
and parsing large text
collections to obtain statistical
models for visually descriptive
language.
4. “Choosing Linguistics over
Vision to Describe
Images”Ankush Gupta, Yashaswi Verma, C.V. JawaharInternational Institute ofInformation Technology, Hyderabad,India – 500032
Problem of automaticallygenerating human-likedescriptions for unseenimages,given a collection ofimages and theircorrespondinghuman-generateddescriptions.
PASCAL dataset They proposed a novel approachfor generating relevant, fluentand human-like descriptions forimages without relyingon any object detectors,classifiers, hand-written rules orheuristics.
THEIR APPROACH
3.2
1.) Choosing Linguistics over Vision to Describe Images
i. Given an unseen image,
ii. find K images most similar to it from the training images,and using the phrases extracted from their descriptions
iii. generate a ranked list of triples which is then used tocompose description for the new image.
i. input image ii.) Neighboring images with extracted phrases
iii.) Triple section and sentence generation
FAILURE SCENARIO
A motor racer is speedingthrough a splash mud.
A water cow is grazingalong a roadside.
An orange fixture is hangingin a messy kitchen.
OUR APPROACH
4
USING
OPEN CV , NLP & SVM
4.1
OPEN CV
• Open source computer vision and machine learning software library.
• More than 2500 optimized algorithms.
• C++, C, Python, Java and MATLAB interfaces
• Supports Windows, Linux, Android and Mac OS
NLP(Natural Language Processing)
It is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions
between computers and human (natural) languages.
SVM(Support Vector Machines)
A discriminative classifier formallydefined by a separatinghyperplane. In other words, givenlabeled training data (supervisedlearning), the algorithm outputs anoptimal hyperplane whichcategorizes new examples.
SYSTEM FLOW
4.2
Take query image as input
Detect objects from query image
Corpus data & extract shortest sentences
RDF (Resource Description Framework)Parser
<object1,predicate1,object2>
Google image API
retrieve top 10 images
Match query image and compute score for each retrieval image
Highest score images are our matching triplet
EXPERIMENTAL SETUP
4.3
DATA SET
PASCAL (Pattern Analysis, Statistical Modeling and Computational Learning)
It provides standardized image data sets for object class recognition
Technology
JAVA/PYTHON
Thus we saw the fundamentalsof scene description, itsapplications, previous work inthis field and our approach fordesigning this system.
5. CONCLUSION
THANK YOU!
Recommended