Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval

Embed Size (px)

Text of Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and...

  • Slide 1
  • Multimedia Retrieval
  • Slide 2
  • Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval
  • Slide 3
  • A Taxonomy of Audio Sound MusicOther?Speech Classical Country DiscoHip Hop Jazz Rock Sports Announcer Female Male Orchestra String Quartet Choir Piano ?
  • Slide 4
  • Spoken Document Retrieval
  • Slide 5
  • Slide 6
  • Acoustic Modeling Describes the sounds that make up speech Lexicon Describes which sequences of speech sounds make up valid words Language Model Describes the likelihood of various sequences of words being spoken Speech Recognition Speech Recognition Knowledge Sources
  • Slide 7
  • Speech Recognition in Brief Pronunciation Lexicon Signal Processing Phonetic Probability Estimator (Acoustic Model) Decoder (Language Model) Words Speech Grammar
  • Slide 8
  • Hints For Better Recognition Topical information News of the day Image information ? Goal: improve the estimation p(word|acoustic_sig) Main idea: p(word|acoustic_sign) p(word|acoustic_signal, X) What could be X?
  • Slide 9
  • Hints For Better Recognition Topical information News of the day Image information Lip reading Video Optical Character Recognition (VOCR) Goal: improve the estimation p(word|acoustic_sig) Main idea: p(word|acoustic_sign) p(word|acoustic_signal, X) What could be X?
  • Slide 10
  • Speech Recognition Accuracy Word Error Rate
  • Slide 11
  • Information Retrieval Precision vs. Speech Accuracy Word Error Rate % of Text IR 100 90 80 70 60 50 40 30 Relative Precision 0 10 20 30 40 50 60 70 80 Indexing and Search of Multimodal Information, Hauptmann, A., Wactlar, H. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP-97), Munich, Germany, April 1997. A rather small degradation in retrieval when word error rate is small than 30%
  • Slide 12
  • Spoken Document Retrieval Segmentation issue Continuous speech data without story boundaries Typical segmentation approaches Overlapping windows (30 sec for each segment) Automatic detection of speaker changes
  • Slide 13
  • Spoken Document Retrieval: Document Expansion Motivation: documents are erroneous Goal: apply expansion techniques to reduce the impacts of recognition errors in spoken documents Similar to query expansion
  • Slide 14
  • Spoken Document Retrieval: Document Expansion Motivation: documents are erroneous Goal: apply expansion techniques to reduce the impacts of recognition errors in spoken documents Similar to query expansion Clean Doc Collection (web docs) Speech Recognized Transcript doc1 doc2 doc3 doc4 Find common words in top ranked docs
  • Slide 15
  • Spoken Document Retrieval: Document Expansion Motivation: documents are erroneous Goal: apply expansion techniques to reduce the impacts of recognition errors in spoken documents Similar to query expansion Treat each speech document as a query Find clean documents that are relevant to speech documents Expand each speech document with the common words in the top ranked clean documents.
  • Slide 16
  • Document Expansion (Sighal & Piereira, 1999)
  • Slide 17
  • A Taxonomy of Audio Sound MusicOther?Speech Classical Country DiscoHip Hop Jazz Rock Sports Announcer Female Male Orchestra String Quartet Choir Piano ?
  • Slide 18
  • Music Information Retrieval
  • Slide 19
  • Music Retrieval A textual retrieval approach Using meta data: titles, artists, genres, Content-based music retrieval Query by audio Query by score document/segment
  • Slide 20
  • Content-based Music Retrieval Short-term Autocorrelation Note Segmentation Mid-level Representation Similarity Comparison Query results (Ranked song list) Songs Database Midi message Extraction Microphone Signal input Sampling 11KHz Center Clipping Off-line processing On-line processing 67 64 65 62 60 (Midi representation) -3 1 -3 -2
  • Slide 21
  • Content-based Music Retrieval : 1 1 2 0 -2 0 1 2 0 : -3 1 1 2 N-gram representation 1 1 2C111 1 2 0C220 2 0 2C310 0 2 0C410 -3 1 1C501 A vector representation for each music document A typical information retrieval problem
  • Slide 22
  • Document Image Analysis and Retrieval
  • Slide 23
  • Document Image Analysis Recognize text (OCR) convert page images to Unicode machine-printed, handwritten Analyze page layout geometry a 2-D problem (unlike speech, text) good language-free algorithms Capture logical structure output marked-up text (XML, etc) exploit non-textual clues
  • Slide 24
  • Video/Image OCR Block Diagram Text Area Detection Text Area Preprocessing Commercial OCR Video or Image UTF8 Text
  • Slide 25
  • Text Detection
  • Slide 26
  • Low resolution (as low as 10 pixel height/character) limited by NTSC (352x248) /PAL/SECAM TV standard Complex background Character Hue and Brightness similar to background Video OCR
  • Slide 27
  • VOCR Preprocessing Problems
  • Slide 28
  • Video Frames (1/2 s intervals) Filtered FramesAND-ed Frames
  • Slide 29
  • Slide 30
  • OCR Document Retrieval Task: find OCR recognized document relevant to a information need Challenge: erroneous documents needs to handle with word errors
  • Slide 31
  • OCR Document Retrieval Correction based approaches Find potential word errors and replace each with the most likely correct one Partial matching approaches Word a set of n-grams Word matches n-gram matches
  • Slide 32
  • Video Retrieval
  • Slide 33
  • Video Retrieval - Application of Diverse Technologies Speech understanding for automatically derived transcripts Image understanding for video paragraphing; face, text and other object recognition Natural language for query expansion, topic detection and content summarization Human computer interaction for video display, navigation and reuse Integration overcomes limitation of each
  • Slide 34
  • Introduction to TREC Video Retrieval Track NIST TREC Video Track web site: http://www- nlpir.nist.gov/projects/trecvid/ Video Retrieval Track started in 2001 Investigation of content-based retrieval from digital video Focus on the shot as the unit of information retrieval rather than the scene or story/segment/clip
  • Slide 35
  • The TRECVID Collections 2001 - 11 hours, 74 queries, 8000 shots 2002 - 40 hours, 25 queries, 14000 shots Video from the Internet Archive between the 50s and 70s Advertising, educational, industrial and amateur films Common shot boundaries 2003 56 hours, 25 queries, 32000 shots 1998 Broadcast News (CNN, ABC, CSpan) + Common Speech Recognition + Common Annotations 2004 61 hours, 24 queries, 33000 shots More 1998 Broadcast News
  • Slide 36
  • Sample Query and Target Query: Find pictures of Harry Hertz, Director of the National Quality Program, NIST Speech: Were looking for people that have a broad range of expertise that have business knowledge that have knowledge on quality management on quality improvement and in particular OCR: H,arry Hertz a Director aro 7 wa-,i,,ty Program,Harry Hertz a Director
  • Slide 37
  • System Architecture (Trec Video Track 2001) Combine video, audio and text retrieval scores Query TextImageAudio Text ScoreImage ScoreAudio Score Retrieval Agents Final Score
  • Slide 38
  • ARRRecall ASR Transcripts1.84%13.2% VOCR5.93%7.52% Image Retrieval14.99%24.45% Combine18.9%28.25% Results for TREC01