Upload
cornelius-stevenson
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Audio Retrieval
LBSC 708A
Session 11, November 20, 2001
Philip Resnik
Agenda
• Questions
• Group thinking session
• Speech retrieval
• Music retrieval
Shoah Foundation Collection
• 52,000 interviews– 116,000 hours (13 years)– 32 languages
• Full description cataloging– 14,000 term thesaurus– 4,000 interviews for $8 million
Audio Retrieval
• We have already discussed three approaches– Controlled vocabulary indexing– Ranked retrieval based on associated captions– Social filtering based on other users’ ratings
• Today’s focus is on content-based retrieval– Analogue of content-based text retrieval
Audio Retrieval
• Retrospective retrieval applications– Search music and nonprint media collections– Electronic finding aids for sound archives– Index audio files on the web
• Information filtering applications– Alerting service for a news bureau– Answering machine detection for telemarketing– Autotuner for a car radio
The Size of the Problem• 30,000 hours in the Maryland Libraries
– Unique collections with limited physical access
• 116,000 hours in the Shoah collection
• Millions of hours of streaming audio each year– Becoming available worldwide on the web
• Broadcast news (audio/video)– Ex. Television archive
HotBot Audio Search Results
Audio Genres
• Speech-centered– Radio programs– Telephone conversations– Recorded meetings
• Music-centered– Instrumental, vocal
• Other sources– Alarms, instrumentation, surveillance, …
Detectable Speech Features
• Content – Phonemes, one-best word recognition, n-best
• Identity – Speaker identification, speaker segmentation
• Language– Language, dialect, accent
• Other measurable parameters– Time, duration, channel, environment
How Speech Recognition Works
• Three stages– What sounds were made?
• Convert from waveform to subword units (phonemes)
– How could the sounds be grouped into words?• Identify the most probable word segmentation points
– Which of the possible words were spoken?• Based on likelihood of possible multiword sequences
• All three stages are learned from training data– Using hill climbing (a “Hidden Markov Model”)
Using Speech Recognition
PhoneDetection
WordConstruction
WordSelection
Phonen-grams
Phonelattice
Words
Transcriptiondictionary
Languagemodel
One-besttranscript
Wordlattice
• Segment broadcasts into 20 second chunks• Index phoneme n-grams
– Overlapping one-best phoneme sequences– Trained using native German speakers
• Form phoneme trigrams from typed queries– Rule-based system for “open” vocabulary
• Vector space trigram matching– Identify ranked segments by time
ETHZ Broadcast News Retrieval
Phoneme Trigrams
• Manage -> m ae n ih jh– Dictionaries provide accurate transcriptions
• But valid only for a single accent and dialect
– Rule-base transcription handles unknown words
• Index every overlapping 3-phoneme sequence– m ae n– ae n ih– n ih jh
ETHZ Broadcast News Retrieval
Cambridge Video Mail Retrieval• Added personal audio (and video) to email
– But subject lines still typed on a keyboard
• Indexed most probable phoneme sequences
Cambridge Video Mail Retrieval
• Translate queries to phonemes with dictionary– Skip stopwords and words with 3 phonemes
• Find no-overlap matches in the lattice– Queries take about 4 seconds per hour of material
• Vector space exact word match– No morphological variations checked– Normalize using most probable phoneme sequence
• Select from a ranked list of subject lines
Contrast of Approaches• Rule-based transcription
– Potentially errorful– Broad coverage, handles unknown words
• Dictionary-based transcription– Good for smaller settings– Accurate
• Both susceptible to the problem of variability
BBN Radio News Retrieval
Comparison with Text Retrieval
• Detection is harder– Speech recognition errors
• Selection is harder– Date and time are not very informative
• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks
Speaker Identification
• Gender– Classify speakers as male or female
• Identity– Detect speech samples from same speaker– To assign a name, need a known training sample
• Speaker segmentation– Identify speaker changes– Count number of speakers
A Richer View of Speech
• Speaker identification– Known speaker and “more like this” searches– Gender detection for search and browsing
• Topic segmentation via vocabulary shift– More natural breakpoints for browsing
• Speaker segmentation– Visualize turn-taking behavior for browsing– Classify turn-taking patterns for searching
Other Possibly Useful Features
• Channel characteristics– Cell phone, landline, studio mike, ...
• Accent– Another way of grouping speakers
• Prosody– Detecting emphasis could help search or browsing
• Non-speech audio– Background sounds, audio cues
Competing Demands on the Interface
• Query must result in a manageable set– But users prefer simple query interfaces
• Selection interface must show several segments– Representations must be compact, but informative
• Rapid examination should be possible– But complete access to the recordings is desirable
Iterative Prototyping Strategy
• Select a user group and a collection• Observe information seeking behaviors
– To identify effective search strategies
• Refine the interface– To support effective search strategies
• Integrate needed speech technologies• Evaluate the improvements with user studies
– And observe changes to effective search strategies
The VoiceGraph Project
• Exploring rich queries– Content-based, speaker-based, structure-based
• Multiple cues in the selection interface– Turn-taking, gender, query terms
• Flexible examination– Text transcript, audio skims
Depicting Turn Taking Behavior
• Time is depicted from left to right
• Speakers separated vertically within a depiction
• Depictions stacked vertically in rank order
• Actual recordings are more complex
1
2
3
4
Bootstrapping the Prototype
• Select a user population and a collection– Journalists and historians– Broadcast news from the 1960’s and 1970’s
• Mock up an interface– Pilot study to see if we’re on the right track
• Integrate “back end” speech processing– Recognition, identification, segmentation, ...
• Observe information seeking behaviors
New Zealand Melody Index
• Index musical tunes as contour patterns– Rising, descending, and repeated pitch– Note duration as a measure of rhythm
• Users sing queries using words or la, da, …– Pitch tracking accommodates off-key queries
• Rank order using approximate string match– Insert, delete, substitute, consolidate, fragment
• Display title, sheet music, and audio
Contour Matching Example
• “Three Blind Mice” is indexed as:– *DDUDDUDRDUDRD
• * represents the first note
• D represents a descending pitch (U is ascending)
• R represents a repetition (detectable split, same pitch)
• My singing produces:– *DDUDDUDRRUDRR
• Approximate string match finds 2 substitutions
Muscle Fish Audio Retrieval
• Compute 4 acoustic features for each time slice– Pitch, amplitude, brightness, bandwidth
• Segment at major discontinuities– Find average, variance, and smoothness of segments
• Store pointers to segments in 13 sorted lists– Use a commercial database for proximity matching
• 4 features, 3 parameters for each, plus duration
– Then rank order using statistical classification
• Display file name and audio
Muscle Fish Audio Retrieval
Summary
• Limited audio indexing is practical now– Audio feature matching, answering machine detection
• Present interfaces focus on a single technology– Speech recognition, audio feature matching– Matching technology is outpacing interface design
-
October 1, 2001 LBSC 708R
Speech-Based Retrieval Systems
Douglas W. Oard
College of Library and Information Services
University of Maryland
The Size of the Problem
• 30,000 hours in the Maryland Libraries– Unique collections with limited physical access
• Over 100,000 hours in the National Archives– With new material arriving at an increasing rate
• Millions of hours broadcast each year– Over 2,500 radio stations are now Webcasting!
Outline
• Retrieval strategies
• Some examples
• Comparing speech and text retrieval
• Speech-based retrieval interface design
Global Internet Audio
source: www.real.com, Mar 2001
10621438
English
OtherLanguages
Over 2500 Internet-accessible
Radio and TelevisionStations
Shoah Foundation Collection
• 52,000 interviews– 116,000 hours (13 years)– 32 languages
• Full description cataloging– 14,000 term thesaurus– 4,000 interviews for $8 million
Speech Retrieval Approaches
• Controlled vocabulary indexing
• Ranked retrieval based on associated text
Automatic feature-based indexing
• Social filtering based on other users’ ratings
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Query Reformulation and
Relevance Feedback
SourceReselection
Nominate ChoosePredict
HotBot Audio Search Results
ETH Zurich Radio News Retrieval
BBN Radio News Retrieval
AT&T Radio News Retrieval
MIT “Speech Skimmer”
Cambridge Video Mail Retrieval
CMU Television News Retrieval
Comparison with Text Retrieval
• Detection and ranking are harder– Because of speech recognition errors
• Selection is harder– Useful titles are sometimes hard to obtain– Date and time alone may not be informative
• Examination is harder– Browsing is harder in strictly linear media
A Richer View of Speech
• Speaker identification– Known speakers– Gender labeling– “More like this” searches
• Topic segmentation– Find natural breakpoints for browsing
• Speaker segmentation– Extract turn-taking behavior
Visualizing Turn-Taking
Other Available Features
• Channel characteristics– Cell phone, landline, studio mike, ...
• Cultural factors– Language, accent, speaking rate
• Prosody– Emphasis detection
• Non-speech audio– Background sounds, audio cues
Competing Demands on the Interface
• Query must result in a manageable set– But users prefer simple query interfaces
• Selection interface must show several segments– Representations must be compact, but informative
• Rapid examination should be possible– But complete access to the recordings is desirable
The VoiceGraph Project
• Exploring rich queries– Content-based, speaker-based, structure-based
• Multiple cues in the selection interface– Turn-taking, gender, query terms
• Flexible examination– Text transcript, audio skims
Pilot Study
• Student focus groups – 15 from Journalism, 3 from Library Science
• Preliminary drawing exercise
• Static screen shots and mock-ups
• Focused discussion
• User satisfaction questionnaire
• Structured interviews with domain experts– Journalism and Library Science faculty
Pilot Study Results
• Graphical speech representations appear viable– Expected to be useful for high level browsing
• When coupled with text transcripts and audio replay
– Some training will be needed
• Suggested improvements– Adjust result set spacing to facilitate rapid selection – Identify categories (monologue, conversation, …)
• Potentially useful for search or browsing
For More Information
• Speech-based information retrieval– http://www.clis.umd.edu/dlrg/speech/
• The VoiceGraph project– http://www.clis.umd.edu/dlrg/voicegraph/
Comparison with Text Retrieval
• Detection is harder– Speech recognition errors
• Selection is harder– Date and time are not very informative
• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks