Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik

Audio Retrieval

LBSC 708A

Session 11, November 20, 2001

Philip Resnik

Agenda

• Questions

• Group thinking session

• Speech retrieval

• Music retrieval

Shoah Foundation Collection

• 52,000 interviews– 116,000 hours (13 years)– 32 languages

• Full description cataloging– 14,000 term thesaurus– 4,000 interviews for $8 million

Audio Retrieval

• We have already discussed three approaches– Controlled vocabulary indexing– Ranked retrieval based on associated captions– Social filtering based on other users’ ratings

• Today’s focus is on content-based retrieval– Analogue of content-based text retrieval

Audio Retrieval

• Retrospective retrieval applications– Search music and nonprint media collections– Electronic finding aids for sound archives– Index audio files on the web

• Information filtering applications– Alerting service for a news bureau– Answering machine detection for telemarketing– Autotuner for a car radio

The Size of the Problem• 30,000 hours in the Maryland Libraries

– Unique collections with limited physical access

• 116,000 hours in the Shoah collection

• Millions of hours of streaming audio each year– Becoming available worldwide on the web

• Broadcast news (audio/video)– Ex. Television archive

HotBot Audio Search Results

http://www.hotbot.com/

Audio Genres

• Speech-centered– Radio programs– Telephone conversations– Recorded meetings

• Music-centered– Instrumental, vocal

• Other sources– Alarms, instrumentation, surveillance, …

Detectable Speech Features

• Content – Phonemes, one-best word recognition, n-best

• Identity – Speaker identification, speaker segmentation

• Language– Language, dialect, accent

• Other measurable parameters– Time, duration, channel, environment

How Speech Recognition Works

• Three stages– What sounds were made?

• Convert from waveform to subword units (phonemes)

– How could the sounds be grouped into words?• Identify the most probable word segmentation points

– Which of the possible words were spoken?• Based on likelihood of possible multiword sequences

• All three stages are learned from training data– Using hill climbing (a “Hidden Markov Model”)

Using Speech Recognition

PhoneDetection

WordConstruction

WordSelection

Phonen-grams

Phonelattice

Words

Transcriptiondictionary

Languagemodel

One-besttranscript

Wordlattice

• Segment broadcasts into 20 second chunks• Index phoneme n-grams

– Overlapping one-best phoneme sequences– Trained using native German speakers

• Form phoneme trigrams from typed queries– Rule-based system for “open” vocabulary

• Vector space trigram matching– Identify ranked segments by time

ETHZ Broadcast News Retrieval

Phoneme Trigrams

• Manage -> m ae n ih jh– Dictionaries provide accurate transcriptions

• But valid only for a single accent and dialect

– Rule-base transcription handles unknown words

• Index every overlapping 3-phoneme sequence– m ae n– ae n ih– n ih jh

ETHZ Broadcast News Retrieval

http://antares.ethz.ch/mmir/news95

Cambridge Video Mail Retrieval• Added personal audio (and video) to email

– But subject lines still typed on a keyboard

• Indexed most probable phoneme sequences

Cambridge Video Mail Retrieval

• Translate queries to phonemes with dictionary– Skip stopwords and words with 3 phonemes

• Find no-overlap matches in the lattice– Queries take about 4 seconds per hour of material

• Vector space exact word match– No morphological variations checked– Normalize using most probable phoneme sequence

• Select from a ranked list of subject lines

Contrast of Approaches• Rule-based transcription

– Potentially errorful– Broad coverage, handles unknown words

• Dictionary-based transcription– Good for smaller settings– Accurate

• Both susceptible to the problem of variability

BBN Radio News Retrieval

Comparison with Text Retrieval

• Detection is harder– Speech recognition errors

• Selection is harder– Date and time are not very informative

• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks

Speaker Identification

• Gender– Classify speakers as male or female

• Identity– Detect speech samples from same speaker– To assign a name, need a known training sample

• Speaker segmentation– Identify speaker changes– Count number of speakers

A Richer View of Speech

• Speaker identification– Known speaker and “more like this” searches– Gender detection for search and browsing

• Topic segmentation via vocabulary shift– More natural breakpoints for browsing

• Speaker segmentation– Visualize turn-taking behavior for browsing– Classify turn-taking patterns for searching

Other Possibly Useful Features

• Channel characteristics– Cell phone, landline, studio mike, ...

• Accent– Another way of grouping speakers

• Prosody– Detecting emphasis could help search or browsing

• Non-speech audio– Background sounds, audio cues

Competing Demands on the Interface

• Query must result in a manageable set– But users prefer simple query interfaces

• Selection interface must show several segments– Representations must be compact, but informative

• Rapid examination should be possible– But complete access to the recordings is desirable

Iterative Prototyping Strategy

• Select a user group and a collection• Observe information seeking behaviors

– To identify effective search strategies

• Refine the interface– To support effective search strategies

• Integrate needed speech technologies• Evaluate the improvements with user studies

– And observe changes to effective search strategies

The VoiceGraph Project

• Exploring rich queries– Content-based, speaker-based, structure-based

• Multiple cues in the selection interface– Turn-taking, gender, query terms

• Flexible examination– Text transcript, audio skims

Depicting Turn Taking Behavior

• Time is depicted from left to right

• Speakers separated vertically within a depiction

• Depictions stacked vertically in rank order

• Actual recordings are more complex

1

2

3

4

Bootstrapping the Prototype

• Select a user population and a collection– Journalists and historians– Broadcast news from the 1960’s and 1970’s

• Mock up an interface– Pilot study to see if we’re on the right track

• Integrate “back end” speech processing– Recognition, identification, segmentation, ...

• Observe information seeking behaviors

New Zealand Melody Index

• Index musical tunes as contour patterns– Rising, descending, and repeated pitch– Note duration as a measure of rhythm

• Users sing queries using words or la, da, …– Pitch tracking accommodates off-key queries

• Rank order using approximate string match– Insert, delete, substitute, consolidate, fragment

• Display title, sheet music, and audio

Contour Matching Example

• “Three Blind Mice” is indexed as:– *DDUDDUDRDUDRD

• * represents the first note

• D represents a descending pitch (U is ascending)

• R represents a repetition (detectable split, same pitch)

• My singing produces:– *DDUDDUDRRUDRR

• Approximate string match finds 2 substitutions

http://www.nzdl.org/

Muscle Fish Audio Retrieval

• Compute 4 acoustic features for each time slice– Pitch, amplitude, brightness, bandwidth

• Segment at major discontinuities– Find average, variance, and smoothness of segments

• Store pointers to segments in 13 sorted lists– Use a commercial database for proximity matching

• 4 features, 3 parameters for each, plus duration

– Then rank order using statistical classification

• Display file name and audio

Muscle Fish Audio Retrieval

http://www.musclefish.com/cbrdemo.html

Summary

• Limited audio indexing is practical now– Audio feature matching, answering machine detection

• Present interfaces focus on a single technology– Speech recognition, audio feature matching– Matching technology is outpacing interface design

-

October 1, 2001 LBSC 708R

Speech-Based Retrieval Systems

Douglas W. Oard

College of Library and Information Services

University of Maryland

The Size of the Problem

• 30,000 hours in the Maryland Libraries– Unique collections with limited physical access

• Over 100,000 hours in the National Archives– With new material arriving at an increasing rate

• Millions of hours broadcast each year– Over 2,500 radio stations are now Webcasting!

Outline

• Retrieval strategies

• Some examples

• Comparing speech and text retrieval

• Speech-based retrieval interface design

Global Internet Audio

source: www.real.com, Mar 2001

10621438

English

OtherLanguages

Over 2500 Internet-accessible

Radio and TelevisionStations

Shoah Foundation Collection

• 52,000 interviews– 116,000 hours (13 years)– 32 languages

• Full description cataloging– 14,000 term thesaurus– 4,000 interviews for $8 million

Speech Retrieval Approaches

• Controlled vocabulary indexing

• Ranked retrieval based on associated text

Automatic feature-based indexing

• Social filtering based on other users’ ratings

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

HotBot Audio Search Results

http://www.hotbot.com/

ETH Zurich Radio News Retrieval

BBN Radio News Retrieval

AT&T Radio News Retrieval

MIT “Speech Skimmer”

Cambridge Video Mail Retrieval

CMU Television News Retrieval


• Detection and ranking are harder– Because of speech recognition errors

• Selection is harder– Useful titles are sometimes hard to obtain– Date and time alone may not be informative

• Examination is harder– Browsing is harder in strictly linear media

A Richer View of Speech

• Speaker identification– Known speakers– Gender labeling– “More like this” searches

• Topic segmentation– Find natural breakpoints for browsing

• Speaker segmentation– Extract turn-taking behavior

Visualizing Turn-Taking

Other Available Features

• Channel characteristics– Cell phone, landline, studio mike, ...

• Cultural factors– Language, accent, speaking rate

• Prosody– Emphasis detection

• Non-speech audio– Background sounds, audio cues

Competing Demands on the Interface

• Query must result in a manageable set– But users prefer simple query interfaces

• Selection interface must show several segments– Representations must be compact, but informative

• Rapid examination should be possible– But complete access to the recordings is desirable

The VoiceGraph Project

• Exploring rich queries– Content-based, speaker-based, structure-based

• Multiple cues in the selection interface– Turn-taking, gender, query terms

• Flexible examination– Text transcript, audio skims

Pilot Study

• Student focus groups – 15 from Journalism, 3 from Library Science

• Preliminary drawing exercise

• Static screen shots and mock-ups

• Focused discussion

• User satisfaction questionnaire

• Structured interviews with domain experts– Journalism and Library Science faculty

Pilot Study Results

• Graphical speech representations appear viable– Expected to be useful for high level browsing

• When coupled with text transcripts and audio replay

– Some training will be needed

• Suggested improvements– Adjust result set spacing to facilitate rapid selection – Identify categories (monologue, conversation, …)

• Potentially useful for search or browsing

For More Information

• Speech-based information retrieval– http://www.clis.umd.edu/dlrg/speech/

• The VoiceGraph project– http://www.clis.umd.edu/dlrg/voicegraph/


• Detection is harder– Speech recognition errors

• Selection is harder– Date and time are not very informative

• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks

Documents

Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik