Multimedia Information Retrieval

© [email protected] – University of Geneva – Multimedia Retrieval – September 2015 - 1

Multimedia Information Retrieval

Stephane Marchand-Maillet Viper group

University of Geneva Switzerland

http://viper.unige.ch


• What is your background?

– Computer Science

– Scientific

– Humanities

• This lecture introduces intuition from examples

– More technical details in recommended reading

Quick get-to-know


• To introduce (recall, rehearse) some key principles and techniques of IR

• To show how these are adapted to Multimedia

• To look at the Multimedia context

• To see how to best face it

Objectives of the lecture


Outline

• Motivation and context

• Preliminary material on IR

• Audio-visual IR

• Complements

• Evaluation

4

Note: Several illustrations from within these slides have been borrowed from the Web, including Wikipedia or teaching material. Please do not reproduce without permission from the respective authors. When in doubt, don't.


Data Production

• Growth of Data – 1,250B GB (=1.2EB) of data generated in

2010.

– Data generation growing at a rate of 58% per year • Baraniuk, R., “More is Less: Signal Processing and the

Data Deluge”, Science, V331, 2011.

1 exabyte (EB) = 1,073,741,824 gigabytes

0

2000

4000

6000

8000

10000

2010 2011 2012 2013 2014

Dat

a Si

ze (

EB)

Data Generation Growth

http://www.intel.com/content/www/us/en/communicati

ons/internet-minute-infographic.html

http://www.ritholtz.com/blog/2011/12/financial-industry-

interconnectedness/

Internet

Scientific

Industry

Data

By Sverre Jarp, By Felix'Schürmann

© Copyright attached

http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html







http://www.ritholtz.com/blog/2011/12/financial-industry-interconnectedness/







A digital world

[From http://1000memories.com/blog/94-number-of-photos-ever-taken-digital-and-analog-in-shoebox]


[Picture from: http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html]

Data communication



User “productivity”

[Picture from: http://www.go-gulf.com/wp-content/themes/go-gulf/blog/online-time.jpg - Feb 2012]



Motivation

• Decision making requires informed choices

• Data production ever easier

• The information is often overwhelming – « Big Data » trend

• The information is often not easy to manage and access

We need to bring a structure to the raw data

• Document (data) representation

• Similarity measurements

• Further analysis: data mining, information retrieval, learning



Components of Data Quality

• Validity

• Integrity

• Completeness

• Consistency

• Relevance

• Accuracy

and…

• Timeliness

• Interpretability

• Reliability

• …

Applies on all media


• Searching for multimedia – Is it like searching text?

• Information retrieval model

• Representing documents – BoW and TF*IDF

• Indexing model – Inverted files

– hashing

Preliminary


Searching for multimedia

Basic solution:

• Attach text to multimedia and search for text

– Tags, keywords

– Description

“Multimedia Annotation”

– Domain of “Knowledge Management”

– Requires the definition of Ontologies, Taxonomies,…

Human in the loop

– Not scalable, not feasible


[Picture from: http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html]

Per minute rate (some years ago!)



• From the sensing device – Device type, model – Sensing conditions (flash/no flash) – Author (device owner?) – …

• From the environment – Time, date, location – Environment conditions (inside/outside, weather) – Spoken languages – …

• More? From the content? – Later in this lecture

Note: Can never be exhaustive

Auto-annotation


• “Wasp on a yellow flower” • Specific (latin) name of the flower? Of the insect? • What is the insect exactly doing?

• Weather looks nice: “countryside scene”? • The picture may be arty:

“2015 1st price college photo competition winner”

Inference cannot resolve all • “The wasp who scared my daughter during our holiday in

Gruyere”

Annotation


• Exif: standard picture annotation container • ID3-tag: standard audio information • NTFS: window-based metadata exchange • JPSearch, MPEG 7: Search-oriented description

standards

• Every format will have a text header to insert “comments” (PNG, GIF, JPG, TIFF,…)

Complementarity text-based / content-based • Other category is recommender-based

– “social” annotation

Annotating mulitmedia


MPEG-7 Still Image Description


IBM MPEG-7 Annotation Tool (video/visual stream) [http://www.research.ibm.com/VideoAnnEx/]


• Given a query (information need) • Given a repository (information source) Infer the most relevant documents from the repository as an answer to the query

The most used query type is Query-by-Example “I want something like this”

• Descriptive query • Free text (Google-like) query • Example document-s (“ ”)

• Relevance is similarity • Similarity is transferred to “closeness”

More like this…

Information Retrieval paradigm


• IR tells us that we do not (really) need to have a complete absolute understanding of documents to respond queries, we (just) need to be able to compare 2 documents

• We merely need appropriate distance measurements – To infer similarity – Based on document features (space) – and a notion of vicinity (distance)

Everything is about inspecting neighborhoods – Provided the distance is semantically relevant

Information Retrieval


Information management process

Raw documents

Representation space (visualisation)

Document features

User interaction

Feature extraction

“Appropriate” mapping

“Decision” process

Query


Distance-based search

Representation space (visualisation)

Query

Relevance list (sorted by distance

to the query)


Example: text

Text documents

Feature extraction


User interaction


“Word” occurrences

Query


Also...

• Any type of media: webpage, audio, video, data,...

• Objects, based on their characteristics

• People in social networks

• Concepts: processes, states, ... Etc

Anything for which “characteristics” may be measured

This is what this lecture is about


• Raw data (the documents) carries information + Computers essentially perform additions We need to represent the data somehow to provide the computer with as much as possible faithful information

• The representation is an opportunity for us to

transfer some prior (domain) knowledge as design assumptions

If this (data modelling) step is flawed, the computer will work with random information

Representation spaces (intuition)


From documents to features

• Features help characterizing 1 document (summary)

• Features help comparing 2 documents (similarity)

• Features help structuring a collection (visualization)


Comparing two documents (occurrences)

2 1 1 2 1 0 0

1 2 0 1 1 1 2

1 2 0 1 1 1 2

1 1 0 1 1 0 0

“Terms” (Vocabulary)

“Do

cum

en

ts”

(R

epo

sito

ry)


Comparing two documents (presence)

Y Y Y Y Y N N

Y Y N N Y Y Y

Y Y N N Y Y Y

Y Y N Y Y N N

Similarity: Y+Y=1 Distance: Y+N=1


Comparing two documents (presence)

Y Y Y Y Y N N

Y Y N Y Y Y Y

1 1 0 1 1 0 0

Similarity = 4 (Distance=3)


Comparing two documents

6 (0)

3 (4) 3 (4)

4 (1)

3 (2) 3 (2)

S(D)


“ “

Comparing two documents S(D)

D1 Y Y Y Y Y N N 1(6)

D2 Y Y N N Y Y Y 2(4)

D3 Y Y N N Y Y Y 2(4)

D4 Y Y N Y Y N N 0 (7)

Q N N Y N N Y Y

Nota: The query is seen as a document (Query-by-example)


Query-based ranking

“ “

• Since similarity is computed only on (Y,Y) pairs, only terms count for computing the similarity

Every document having no common term with the query scores 0

(ie terms not in the query are ignored)

2(4) 2(4) 1(6) 0(7)

D4 D3 D1 D2 Q


Computing the ranks

Naïve solution (fill the array):

• For every document, for every term, score 1 if the term is both in the query and the document

N (documents) * M (terms) comparisons

Smarter solution

• For every term of the query (few), find documents having this term and work from there (the rest will score 0)


Information indexing

Goal:

to organize the data so as to

facilitate its access (read or write)

• Fast access for computation

• Fast access with selection criteria

• …

Baseline: exhaustive search, sequential scan (unordered list)

Strategy: enable “direct access”

Simple example:

• Address book: sorted list, with first letter shortcut

– Smith “S” search 19th sublist (ignore 25 other sublists)


Indexing structures

…+ M-tree Tries Suffix array Suffix Tree Inverted files LSH…

Illustration: Wikipedia


Term Documents

D1,D2,D3,D4

D1,D2,D3,D4

D1

D1,D4

D1,D2,D3,D4

D2,D3

D2,D3

“find documents having this term”

Arrange documents by the terms they contain

Number of comparisons ~ number of query terms

<< N*M

Inverted file

(D1),(D2,D3),(D2,D3)

D1, D2, D2, D3, D3

D4 is never considered ! (we know already that its score is 0)


Principles and lessons

• We need to be able to compare 2 documents

• Comparison is based on document features

• Similarity may be based on feature occurrence

• Only occurring features count in building similarity

Feature occurrence allows for the use of inverted files to get independent from the size of the repository

Feature occurrence loses a lot of structure in the document


In practice (text IR)

• Document features are normalised word (stem) occurrences (Bag of Words model)

– Vocabulary ~ 10’000 stems

– Billions of documents

• Term weighting scheme (TF*IDF) accounts for feature statistics (better than Y/N occurrence)

• Similarity is based on cosine distance

• Inverted files include term statistics


BoW : Assumptions

• A text is reduced to the base vocabulary it uses • Example:

• Becomes:

The ultimate goal is to define similarity as:

Two documents are similar if they share the same vocabulary

Signals from a particle collider near Geneva suggested in September that scientists might have sighted a long-sought particle thought to be the source of mass itself.

A (x2), be, collide, from, geneva, have, in, itself, long-sought, mass, may, near, of, particle (x2), scientist, see, september, signal, source, suggest, that, the, think, to


BoW: Limitations • The structure is lost

– Title, sections, paragraphs, sentences, negations

• The representation is incomplete… – If all words appear once only and there is no extra information such as

• Grammatical type

• Priors on terms (recent news, unused terms,…)

– And writing well is often about using a rich vocabulary • Use of synonyms

• …or ambiguous – If a word appears many times but because of polysemy

– Eg: mouse • A part on mouse as a pet

• A part on computer mouse

• A part on Mickey mouse

– Does not make the text about “mouse” but maybe just about “kid amusements”

… still works very well in practice


A document is represented by the occurrences of the terms of the vocabulary • Earlier example: Y/N occurrence Number of occurrence does not matter

• Counting the term occurrence Favors longer documents (more occurrences) Document length normalization (frequency)

• Not all frequent terms are relevant Balance between representation (TF) and discriminative

power (IDF) How good in my term in my query for this document?

A bit more details


• Term frequency (TF) – Percentage of space taken by each term in the document: “how much this term represents the document” High TF frequent term in the document potential representative term (keyword)

• Inverse document frequency (IDF)

– Inverse of the percentage of space taken by each term in the collection:

“how much this term discriminates documents” High DF low IDF term frequent in all documents not discriminant (all documents are relevant) Eg: the, at, and, she, he, her… + thematic (patient, disease,…) Good terms are terms with high TF and high IDF

TF*IDF model: intuition


High TF (frequent in document)

Potential keyword

Low TF (infrequent in document)

High IDF (infrequent in collection)

“Keyword” Good representative

Verb, noun,..

Low representation

(automatically ignored)

Low IDF (frequent in collection)

“Stop word” Does not carry meaning

“structural” The, at, and, she,…

(removed from voc / ignored)

Rare term

(automatically ignored)

TF*IDF term weight

When computing the difference between documents, terms are weighted by their TF*IDF factor


Inverted file construction

Docs

extraction sorting compaction

Term Doc ID

“blah” N

“new” 1

“the” 1

“tool” N-1

… ...

“tool” 1

“blah” N

“blah” 1

Term Doc ID

“the” 1

“new” 1

“tool” 1

“blah” N

… ...

“tool” N-1

“blah” N

“blah” 1

Term Doc ID

“blah” 1

“new” 1

“the” 1

“tool” N-1

… ...

“tool” 1

“blah” N

Freq

1

1

1

1

...

1

2

Term Tot. F.

“blah” 3

“the” 1

… ...

“tool” 2

“new” 1

...

Doc ID

1

1

1

N-1

1

N

Freq

1

1

1

1

...

1

2

factorisation Misc

operations

Eg: posting file compression

Dictionary Collection information

« idf » Postings

Document information « tf »


Value Key

1

2

…

20

…

33

…

Hashing / fingerprinting (intuition only)

=1

=2

=3

=4

=5

=6

=7

=1+1+2+3+4+4+5 = 20

=1+2+2+4+5+6+7+7 = 33

Hash Table

Query Immediate access

=1+2+2+4+5+6+7+7 = 33

• Associate a key (hash, fingerprint) to each value

• Insert the (key, value) into a hash table

Immediate retrieval of the query

Difficult constraint (Perceptual hashing) : The key does not vary if the value changes a bit (eg due to noise)

Hash function


What is multimedia?

Basically our digital life

• Text – Plain text (any language) – Structured text (XML-like, code,…)

• Visual – Images (Photo) – Sketches (drawing, map,..)

• Audio – Music – Speech – Sound

• Misc – 3D object – Video – … (software, playlist,…)

+ Any combination – PDF, Web pages and alike


Multimedia IR: Use Cases

• Image – Find look-alike pictures, recognize landmark

• Video – Find specific actions

• Music – Find inspirational music, recognize music extract

• Medical – Find similar cases

• Patent – Find similar proposals, plagiarism

Not always doable with text

47


Volume of multimedia data

• 2 billion persons connected to internet

– Mostly via mobile devices (phones)

• 1.8 billions active mobile phones

• 180 millions active web servers

Sources: R. Baeze-Yates – RussIR 2010 http://news.netcraft.com/


Volume of multimedia data

• Web scale: – Google:

• Early 2004: 4.3 billions indexed pages • Early 2005: 8 billions indexed pages • Today: XX billion indexed pages?

– http://www.archive.org • Growth: 20 Tb/month

• Multimedia collection:

– Facebook: 140 billions photos 180 years at 25 fps

– YouTube (2008): 83.4 million videos

• Curated collections: – AOL Video library

• 410 millions views in October 2011

– Institut National de l'Audiovisuel (INA-France): • 700’000h audio (radio) • 400’000h video (television) • 2 millions documents


The new wave

[From http://1000memories.com/blog/94-number-of-photos-ever-taken-digital-and-analog-in-shoebox]


Example: Information « loss »

• Annotation – " Wasp on a yellow flower "

• Text search:

– « insect feeding This document will be missed

• The user must know what is the annotation limitation

• The goal is to represent the document – Exhaustively: no specific interpretation context – Automatically: no human intervention


• Image formation model

• Image features

• Indexing visual content

Visual Information Retrieval


Visual content has specific types

• Photo

• Medical image, satellite image

• Document: fax, scanned text…

• Drawing: sketch, blueprint, cartoon,…

• 3D, shape

Visual content


Example: Images

Images

Feature extraction



Search Photo collage Filtering

Color histogram

User interaction

Query


Pixels

RGB Sensor RGB Color model

Any color = combination of RGB values Image = 3 Arrays (RGB) of values between 0 and 255 Pixel = RGB values (color) at a position in the image


Image processing / Analysis

RGB Histograms

Gray scale (intensity) Edge image


Low-level image representation

Global Color Histogram data in color spaces:

Global Texture data from Gabor filter banks:


Corner detection / keypoints


Local features: SIFT[Lowe 1999] and SURF[Bay et al 2006]


From data to information

Interpreting the content

• Data is physically measured

• Information is interpreted semantically Semantic gap

• Fusion is a way to reduce the semantic gap

• Examples – Looking at a movie w/o audio

– Listening to a story w/o visual

– Voice conversation vs video conversation

– Disambiguation (eg „jaguar“)



Semantic gap: “Discrepancy between the level of analysis of a computer and the level of

perception of the same data by a human user.”

Level Text Visual

Interpretation Story “Action”

Semantic segmentation

Paragraph Scene

Group Sentence Object

Physical segmentation

Word Homogeneous region

Physical inspection

Character Color, texture

Le

ve

l o

f in

terp

reta

tion

Re

leva

nce

fo

r in

de

xin

g

Com

pu

ter p

recis

ion

Semantic Gap


Semantic gap

• Exploited by Captchas


Region-based image understanding

Images are compared w.r.t the regions (hopefully objects/concepts) they contain

Object detection eg, Face detection


Object detection / recognition

• Train the machine to recognize specific groups of patterns

Faces [Viola, Jones, 2001]

Objects

Automated image tagging Text search


Naming faces : Image to text

Specific characteristics of a face are extracted (eg eigenfaces)


Deep Learning

[http://www.slideshare.net/NVIDIA/visual-computing-the-road-ahead-an-nvidia-ces-2015-presentation-deck]


Bag of Visual Words

• Once basic visual features are extracted, – Eg: the description of the surrounding of each corner point

(eg SIFT)

• Basic features are grouped – Eg corner of an object

• And “quantified” – Many examples are gathered into one instance

• To create a “visual dictionary” of visual words

• An image is a composition (occurrences) of these visual words (“terms”)

We can use of the machinery related to text - Inverted files, etc


Comparing two documents (occurrences)

2 1 1 2 1 0 0

1 2 0 1 1 1 2

1 2 0 1 1 1 2

1 1 0 1 1 0 0

“Terms” (Vocabulary)

“Do

cum

en

ts”

(R

epo

sito

ry)


Hyperbrowser [S. Craver, 1999]


Hyperbrowser [S. Craver, 1999]


Digital shoebox

../collection_guide/index.html


Digital shoebox


Collection mining


• Modeling sound

• Audio information

– Speech, music, sound

• Searching for audio

Audio IR


Example: Audio

Audio

Feature extraction



Playlist Filtering

User interaction

Query


Sound characteristics Sound corresponds to waves perturbing the environment Wave characteristics

– Frequency/Period 1 Hertz (Hz) = 1 vibration/second Related notion of pitch

– Intensity

• Human ear has logarithmic scale Decibel (dB) scale 0dB = 10-12 W/m2 (Threshold of hearing) 10dB = 10-11 W/m2 … x*10dB= 10-(12-x) W/m2

2m

Watt

Area

Power

Time

Energy

Area

1I


Sound perception

Airwave collection Mapping to mechanical waves

Mapping to compressional waves

Mapping to electrical signal

0dB = 10-12 W/m2 : Threshold of hearing 160 dB = 10-4 W/m2 : Instant perforation of eardrum


200k 100k

Sound perception

0

Frequency f (Hz)

Human (20-20k)

Dog (50-45k)

Cat (50-85k)

50 100 500 1k 5k 10k 50k

Bats (50-85k)

Dolphins (-200k)

Bats (-120k)

Elephant (5-)

Infrasound Audible sound Ultrasound


Audio and Sound

50 100

80

20

500 1k 5k 10k 20k

40

60

0

120

100

Non-audible

Painful

Audible sounds

Speech

Frequency f (Hz)

Intensity I (dB) Partly adapted from [Schäuble,1997]

Quiet library,

soft whisper

Quiet office,

living room,

Light traffic at a distance,

gentle breeze

Conversation,

Sewing machine

Busy traffic,

noisy restaurant

Subway, heavy city traffic,

alarm clock at 2 feet,

factory noise

Truck traffic, noisy home

appliances, shop tools,

lawnmower

Chain saw,

pneumatic drill

Rock concert in front of

speakers, thunderclap

(180) Rocket launching pad

(140) Gunshot blast, jet plane


Audio information

• Speech – Male, female speech Signal-based speech characterisation – Automatic speech recognition (ASR) ASR (text) retrieval

• Music

Genre recognition (Jazz, classic,…) Excerpt retrieval Query by humming …

• Sound

– Car, water, people, animals,… – Often useful in relation to visual (video)


Audio information

high

low

natural man-made

Info

rmat

ion

co

nte

nt

Origin

Animal sounds

Wind Water

Speech Music

Urban noise Machines, engines


Audio features

• Low-level-audio representation – Spectrogram

– MFCC

• Mid-level representations – MIDI strings

– LPC

• High-level representation – Automatic speech recognition (ASR)


Translation

Signal- based

Audio information indexing

Audio

Sound Music Speech

ASR

Wordspotting

Translation

Piece/Genre recognition Classification

Visual

A/V Processing

Transcripts, Speaker info

Score, MIDI


Speech

Speech production model

Speech perception model

Note: In vision, mostly only perception models

Speech


Speech production

Human apparatus Speech production scenarios

1. Voiced sound (vocal tract) 2. Fricative sound (‘S’) 3. Plosive sound (‘P’)


Speech production modeling

Mechanical model Human apparatus

Multistage process 1. Build up

pressure 2. Release 3. Return to first

state


Speech signal modeling

Given w(t) as a speech signal where S(w) is the STFT of w. The model assume H(w) as the vocal tract transfert function and E(w) is the excitation

222)()()( wEwHwS


Note: MP3 coding

Advanced coding model

MP3 = MPEG 1 Audio layer 3

See previous slides


Sony music browser


Islands of music [http://www.oefai.at/~elias/music]

Organisation d'archives musicales

• Basée sur des cartes auto-organisées de Kohonen (Self organising maps – SOM)


ThemeFinder [http://www.themefinder.org/]


• Design a nice perceptual hash function for audio (based on perceptual features)

• Slice an audio document (music piece) into small chunks (large enough to be “unique”)

• For each chunk compute a key through the hash function and store as value the music piece ID (title)

Music query by example

Retrieval by noisy chunk

Audio fingerprinting


• Video type

– Movies

– Amateur souvenir film

– Lectures

– Youtube (short, static)

• Combining individual media (fusion)

– each brings some information…

• Going further: social multimedia…

Multimedia IR



Interpreting the content

• Data is physically measured

• Information is interpreted semantically Semantic gap

• Fusion is a way to reduce the semantic gap

• Examples – Looking at a movie w/o audio

– Listening to a story w/o visual

– Voice conversation vs video conversation

– Disambiguation (eg „jaguar“)


Multimodal video features

Over temporal modalities:

• Segmentation (boundary) cues

• Categorization (region) cues

Word relatedness measure

Visual syntactic homogeneity

Multimodal boundary-based cues

Semantic region- based cues

Audio

Visual

Text


Adding context to interpretation

Video Story Segmentation:

Using fusion between the visual and audio modalities, high-level segmentation may be achieved

“shot” keyframe

Audio transcript


Main goals of fusion

Multimodal information fusion aims at interpreting jointly multiple sources of information representing the same underlying „concept“

The main goal is the extraction of information

By fusing information, one aims at: • Being more accurate in the discovery of the

„concept“ – Each individual stream may be incomplete

• Being more robust in the discovery of the „concept“ – Each individual stream may be distorted (eg, noisy)


Multimodal fusion

• From many sources of information and context, how to make our best to “interpret” the data

representation

color

texture

shape

face

local features

vector space model

Port;;Naval and River;;For Transportation;;Structure;;Architecture;;Vegetable Garden;;Landscape Farmland - Countryside);;Landscape;;Landscape (Seascape);;Landscape;;Panorama;;View of the City;;View 98

data

Interpretation


Levels of fusion

How to organise fusion from classical „unimodal“ interpretation?

Feature extraction

Recognizer Raw data

„Knowledge“

Output

Unique source


Levels of fusion

Raw data

Output

Raw data

Multiple sources

One unique output

Recognizer Decision


Early fusion strategy

• Acts over features

• All modalities are „concatenated into one“

• Only one decision is taken over the concatenated input

Raw data

Output

Raw data

Multiple Sources

Fusion

One unique output

Recognizer Decision

Example: multimodal medical image analysis • Sources: modalities (XRay, PET,…)


Intermediate fusion strategy

• Each source acts as an input for a specific decision

• All recognizers are coupled in some way to form a meta-recognizer

• One decision is taken

Fusion

Raw data

Output

Raw data

Multiple sources

One unique output

Recognizer

Meta- Recognizer

Recognizer

Decision

Example: Video analysis • Source: A/V streams


Late fusion strategy

• Each source is processed individually by a specific recognizer

• Multiple independent initial decisions are taken, possibly associated with confidence scores

• A final decision is taken based on this output

Final decision

Fusion

Raw data

Output

Raw data

Multiple sources

One unique output

Recognizer

Recognizer

Decision

Decision

Decision Example: Object recognition • Source: image regions


Prototype-based fusion

,,,, SIFT)( xxx p72π

query

document x

,,7,2, SIFT)( xxx pπ


Concept “Mona Lisa”

Multimedia is complex data

1503-1506 by L. Da Vinci EN: Mona Lisa IT: La Gioconda FR: La Joconde

Visible in Le Louvre, Paris 48° 51′ 34″ N 2° 19′ 54″ E

Type: Oil on poplar Dimensions: 77 cm × 53 cm (30 in × 21 in)

Photo by Mr. X Date, context Reason of trip

Variations from artists

Embedding into prints

Chocolate piece

URLs about Mona Lisa

Books about Mona Lisa

Videos about Mona Lisa

Audio about Mona Lisa

“official picture”

Texts about Mona Lisa

“Factual” data

“Faithful” data “Related” data


Networked multimodal data

Data

Tags

Users


Features

Networked multimodal data

Data

Tags

Users


• Curse of dimensionality

– Sparsity

– Dimension reduction/visualisation

• Big data

– Indexing

– Visual Analytics

Complements


Statistics tell us that the more data we have, the better we get

Simulation: exponential distribution

n=10 n=100 n=1’000 n=3’000

n=10’000 Mean Standard deviation


Imagine I want to know the age spread of a population • I create 10 intervals (0-10, 10-20,…) and I want to know the proportion of

people in each interval • To have a reliable estimate I want 100 persons per interval in average With 10*100=1000 persons, I get my estimates

Now, imagine I want to estimate the height against age (I add one dimension to my measure) • I create 10 height interval and question, “how many people with height X

that are aged Y?” • I have 10*10=100 such (X,Y) pairs • To get 100 persons for each pair in average I need 100*100 = 10’000

persons

And so on… The size of the data needed for a reliable estimate grows with the

exponential of the dimension (huge!!!) In practice our data size is limited Our only choice to preserve reliability is to reduce the dimension

High dimensionality


Dimension reduction: principle

• Given a set of data in a M-dimensional space, we seek an equivalent representation of lower dimension – For better statistics – For visualization (2D, 3D,..)

• Dimension reduction induces a loss. What to sacrifice? What to preserve? – Preserve local: neighbourhood, distances – Preserve global: distribution of data, variance – Sacrifice local: noise – Sacrifice global: large distances – Map linearly – Unfold data


Some example techniques:

• SFC: preserve neighbourhoods

• PCA: preserve global linear structures

• MDS: preserve linear neighbourhoods

• IsoMAP: Unfold neighbourhoods

• SNE family: unfold statistically

Dimension reduction


Non-recursive Space Filling Curves

Z-Scan Curve Snake Scan Curve

[Illustrations from the lecture “SFC in Info Viz”, Jiwen Huo, Uni Waterloo, CA]


Recursive Space Filling Curves

Hilbert Curve

Gray Code Curve

Peano Curve



3D SFC

Z Curve Peano Curve Hilbert Curve

N-dimensional algorithm: A.R. Butz (April 1971). "Alternative algorithm for Hilbert’s space filling curve.". IEEE Trans. On Computers, 20: 424–42.



PCA

[Illustration Wikipedia]


• A local neighbourhood graph (eg 5-NN graph) is built to create a topology and ensure continuity

• Distances are replaced by geodesics (paths on the neighbourhood graph)

• MDS is applied on this interdistance matrix (eg with m=2)

IsoMap (non Euclidean)

[Illustration from http://isomap.stanford. edu]


• MNIST dataset

t-SNE example

[Illustration from L. van der Maaten’s website]


Traces of our everyday activities can be:

• Captured, exchanged (production, communication)

• Aggregated, Stored

• Filtered, Mined (Processing)

The “V”’s of Big Data:

• Volume, Variety, Velocity (technical)

• and hopefully... Value

Raw data is almost worthless, the added value is in juicing the data into information (and knowledge)

Big Data



Large-scale: Massive data volume, too large to perform sequential scan of a small portion Aim: From a query, “filtering” by (fast) approximate search) a tiny (fixed-size) portion of the data as candidate answers. And then perform (slow) exact search in that tiny-scale sample Method: Assuming the features group the document coherently, approximately identify the vicinity of the query and search within that neighborhood

Tools for Large Scale indexing


Approximate search

Fast approximate filtering border (exact search zone)

Actual slow border

Final results Never considered


Idea: reference-based indexing “If I am close to you, I see the same thing as you”

I am close to Geneva, Montreux and Evian, anything also close to these places is close to me • Fix few reference documents (landmarks) • For each document, measure the distance to each

reference document – Eg: define the set of the 5 closest landmarks

• For a query

– Find its 5 closest landmarks – Find all documents whose 5 closest landmarks are the same – Actually (slow distance) compare to these

Fast approximate filtering


• Fix few reference documents (landmarks) Can be done offline

• For each document, measure the distance to each reference document – Eg: define the set of the 5 closest landmarks Also done offline

• For a query – Find its 5 closest landmarks

Fast since few landmarks

– Find all documents whose 5 closest landmarks are the same Can be made fast (eg Inverted File)

– Actually (slow distance) compare to these Fast since tiny sample

Fast approximate filtering


Permutation-based Indexing

L(x1, R)= (𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟5) L(x2, R)=(𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟5) L(x3, R)= (𝑟5, 𝑟3, 𝑟2, 𝑟4, 𝑟1)

n=5:

D={x1, . . . , x𝑁}, N objects,

R = {𝑟1, . . . , 𝑟𝑛} ⊂ D, n references

Each 𝑜𝑖 is identified by an ordered list: L(x𝑖, R)= {𝑟𝑖1, . . . , 𝑟𝑖𝑛} such that d(x𝑖, 𝑟𝑖𝑗) ≤ d(x𝑖, 𝑟𝑖𝑗+1 ) ∀j = 1, . . . , n − 1

x

y

z

1x2x

3x

4x

5x

r1

r2

r3r4

r5

j

ririSFD

rank

i jjRxLRqLxqdxqd || ),(),(),(),(


• Assess system performance on standard challenges

– Data collection

– Annotation / groud-truth collection

– Query formulation

– Performance measures ( precision, recall,…)

Yearly event / related to scientific conference

In general, participants are academic teams

Evaluation


ImageCLEF

http://www.imageclef.org/

CLEF= Cross Lingual Evaluation Forum


ImageCLEF Wikipedia

Images + text (Overall: 237,434)

• English only: 70,127

• German only: 50,291

• French only: 28,461

• English and German: 26,880

• English and French: 20,747 • German and French: 9,646 • English, German and French: 22,899 • Language undetermined: 8,144 • No textual annotation: 239


TRECVid

The goal of the conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results.

http://trecvid.nist.gov/


MIREX

http://www.music-ir.org

• The Music Information Retrieval Evaluation eXchange (MIREX) is an annual evaluation campaign for Music Information Retrieval (MIR) algorithms, coupled to the International Society (and Conference) for Music Information Retrieval (ISMIR)


Other

• 3D Retrieval – SHREC: Shape Retrieval Contest

– http://www.aimatshape.net/event/SHREC

• XML retrieval – INitiative for the Evaluation of XML retrieval

– https://inex.mmci.uni-saarland.de/

• Many dedicated corpora for specific tasks – Generic (Image-Net)

– Medical (ImageCLEF)

– Satellite imagery

– …


Conclusions

• Multimedia Information Retrieval extends classical IR – Word occurrence model is preserved

• Text IR is not always sufficient / feasible

– Completeness – Scalability

• Multimedia IR requires accurate interpretation of multimodal

data

• Fusion is required for reaching such an accurate level

• Multimedia IR still faces many challenges – Accuracy, interactivity, scalability


Challenges

• Speed and accuracy of response are the major challenges

• IR tends to be mobile

– Low computational power, connection, memory,…

• Large scale

– Big data

• Partial search

– Searching for a part of a document

• Semantic search

– Searching with high-level description


• Baeza-Yates R. and Ribeiro-Neto B. (1999): Modern Information Retrieval. Addison-Wesley. http://people.ischool.berkeley.edu/~hearst/irbook

• Baeza-Yates R. and Ribeiro-Neto B. (2011): Modern Information Retrieval. Addison-Wesley, Second Edition http://www.mir2ed.org/

Recommended reading

http://people.ischool.berkeley.edu/~hearst/irbook



http://www.mir2ed.org/




Thank you!

Questions?


Big Data and Large-scale data – Mohammed, H., & Marchand-Maillet, S. (2015). Scalable Indexing for

Big Data Processing. Chapman & Hall. – Marchand-Maillet, S., & Hofreiter, B. (2014). Big Data Management

and Analysis for Business Informatics. Enterprise Modelling and Information Systems Architectures (EMISA), 9.

– M. von Wyl, H. Mohamed, E. Bruno, S. Marchand-Maillet, “A parallel cross-modal search engine over large-scale multimedia collections with interactive relevance feedback” in ICMR 2011 - ACM International Conference on Multimedia Retrieval.

– H. Mohamed, M. von Wyl, E. Bruno and S. Marchand-Maillet, “Learning-based interactive retrieval in large-scale multimedia collections” in AMR 2011 - 9th International Workshop on Adaptive Multimedia Retrieval.

– von Wyl, M., Hofreiter, B., & Marchand-Maillet, S. (2012). Serendipitous Exploration of Large-scale Product Catalogs. In 14th IEEE International Conference on Commerce and Enterprise Computing (CEC 2012), Hangzhou, CN.

More at http://viper.unige.ch/publications

References


References Large-scale Indexing

– Mohamed, H., & Marchand-Maillet, S. (2015). Quantized Ranking for Permutation-Based Indexing. Information Systems.

– Mohamed, H., Osipyan, H., & Marchand-Maillet, S. (2014). Multi-Core (CPU and GPU) For Permutation-Based Indexing. In Proceedings of the 7th Internation Conference on Similarity Search and Applications (SISAP2014), Los Cabos, Mexico.

– H. Mohamed and S. Marchand-Maillet “Parallel Approaches to Permutation-Based Indexing using Inverted Files” in SISAP 2012 - 5th International Conference on Similarity Search and Applications .

– H. Mohamed and S. Marchand-Maillet “Distributed Media indexing based on MPI and MapReduce” in CBMI 2012 - 10th Workshop on Content-Based Multimedia Indexing.

– H. Mohamed and S. Marchand-Maillet “Enhancing MapReduce using MPI and an optimized data exchange policy”, P2S2 2012 - Fifth International Workshop onParallel Programming Models and Systems Software for High-End Computing.

– Mohamed, H., & Marchand-Maillet, S. (2014). Distributed media indexing based on MPI and MapReduce. Multimedia Tools and Applications, 69(2).

– Mohamed, H., & Marchand-Maillet, S. (2013). Permutation-Based Pruning for Approximate K-NN Search. In DEXA, Prague, CZ.



References Large data analysis – Manifold learning – Sun, K., Morrison, D., Bruno, E., & Marchand-Maillet, S. (2013).

Learning Representative Nodes in Social Networks. In 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Gold Coast, AU.

– Sun, K., Bruno, E., & Marchand-Maillet, S. (2012). Unsupervised Skeleton Learning for Manifold Denoising and Outlier Detection. In International Conference on Pattern Recognition (ICPR'2012), Tsukuba, JP.

– Sun, K., & Marchand-Maillet, S. (2014). An Information Geometry of Statistical Manifold Learning. In Proceedings of the International Conference on Machine Learning (ICML 2014), Beijing, China.

– Wang, J., Sun, K., Sha, F., Marchand-Maillet, S., & Kalousis, A. (2014). Two-Stage Metric Learning. In Proceedings of the International Conference on Machine Learning (ICML 2014), Beijing, China.

– Sun, K., Bruno, E., & Marchand-Maillet, S. (2012). Stochastic Unfolding. In IEEE Machine Learning for Signal Processing Workshop (MLSP'2012), Santander, Spain.