Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign

Extracting Noise-Robust Features from Audio Data

Chris Burges, John Platt,

Erin Renshaw, Soumya Jana*

Microsoft Research

*U. Illinois, Urbana/Champaign

Outline

• Goals

• De-equalization and Perceptual Thresholding

• Oriented PCA

• Distortion Discriminant Analysis

• Results and demo

Audio Feature Extraction Goals

• Map high dim space of audio samples to low dim feature space

• Must be robust to likely input distortions

• Must be informative (different audio segments map to distant points)

• Must be efficient (use small fraction of resources on typical PC)

Audio Fingerprinting as a Test Bed

• Train: Given an audio clip, extract one or more feature vectors

• Test: Compare feature vectors, computed from an audio stream, with database of stored vectors, to identify the clip

• Clip is not changed, unlike watermarking

What is audio fingerprinting?

Why audio fingerprinting?

• Enable user to identify streaming audio (e.g. radio) in real time

• Track music play for royalty assignment

• Identify audio segments in e.g. television or radio: “Did my commercial air?”

• Search unlabeled data, e.g. on web: “Did someone violate my copyright?”

Goals, cont.

• Feature extraction: current audio features are often hand-crafted. Can we learn the features we need?

• Database must be able to handle 1000s of requests, and contain millions of clips

Outline

• Goals


• Oriented PCA



Feature Extraction

Output 64 floats every half-frame

Downsample to 11.025 KHz mono

Compute Robust ProjectionsCollate Samples

(Repeat)

De-EqualizePerceptual Thresholding

Compute windowed FFTcoefficients, take magnitude

De-Equalization

Goal: Remove slow variation in frequency space

MCLT

Magnitude r

Phase

DCT(log r)

Apply highpass mask

exp(Inverse DCT)

Normalize Power

1.0

0.0

De-Equalization

Before After

De-equalize by flattening the log spectrum. But does thisdistort? Can reconstruct signal by adding phases back in:

Outline

• Goals


• Oriented PCA



Oriented PCA

• Want projection so that the variance of the projected signal is maximized, and MSE of projected noise simultaneously minimized. [Daimantaras & Kung, ’96]

• Maximize Generalized Rayleigh coefficient

where C1 is the signal covariance, C2 the noise correlation.

Properties of Oriented PCA

Outline

• Goals


• Oriented PCA



Distortion Discriminant Analysis

DDA is a convolutional neural network whoseweights are trained using OPCA:

2048

372ms, step every 186ms

6.1s, step every 186ms

64 64

32

64

64

OPCA

OPCA

The Distortions

Original

3:1 compress above 30dB

Light Grunge

Distort Bass

“Mackie Mid boost” filter

Old Radio

Phone

RealAudio© Compander

DDA Training Data

• 50 20s audio segments, catted (16.7min)• Compute distortions• Solve generalize eigenvector problem

(after preprocessing)• Subtract mean projections of train data,

norm noise to unit variance• Repeat• Use 10 20s segments as validation set to

compute scale factor for final projections

Outline

• Goals


• Oriented PCA



OPCA vs. Bark Band Averaging

Compare with a ‘standard’ approach:

• 23 ms windows, same preprocessing.• Use 10 Bark bands from 510 Hz to 2.7 KHz.• Average over bands to get 10-dim features.

OPCA vs. Bark Averaging

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

1.2

1.4

Light GrungeDistort Bass

Phone

OPCA Bark

Noi

se-t

o-S

igna

l Var

ianc

e R

atio

Old Radio

OPCA vs. Bark Averaging, cont.

0 5 10 15 20 25 300

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

3 to 1 compress 30dBMackie MidReal Compander

OPCA Bark

No

ise

-to

-Sig

na

l Va

ria

nce

Ra

tio

Summed Distortions

0 2 4 6 810

0

0.5

1

1.5

2

2.5

OPCABark

Projection ID

Su

mm

ed

Va

ria

nce

Ra

tios

DDA Eigenspectrum

0 50 100 150 200 2500

200

400

600

800

1000

1200

1400

Eigenvalue Index

Gen

eral

ized

Eig

enva

lues

(Sig

nal-t

o-no

ise

rat

ios)

1st Layer2nd Layer

.....

DDA for Alignment Robustness

Train second layer with alignmentdistortion:

0 20 40 60 80 100 120 140 160 1800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

No Shift TrainingWith Shift Training

Alignment shift / ms

Min

Tar

get

/Tar

get s

q. d

ist

Min

Ta

rge

t/Non

Tar

get

sq.

dis

t

Test Results – Large Test Set

• 500 songs, 36 hours of music• One stored fingerprint computed for every song,

about 10s in, random start point• Test alignment noise, false positive and negative

rates, same with distortions• Define distance between two clips as minimum

distance between fingerprint of first, and all traces of the second

• Approx. 3.5 108 chances for a false positive

Test Results: Positives

0 50 100 150 200 250 300 350 400 450 5000

0.005

0.01

0.015

0.02

0.025

0.03

372ms steps186ms steps

Fingerprints, sorted by distance

No

rma

lize

d d

ista

nce

sq

ua

red

Test Results: Negatives

Smallest ‘negative’ score: 0.14, largest ‘positive’ score: 0.026

0 10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Smallest 100 of 249,500 target/nontarget values

No

rma

lize

d d

ista

nce

sq

ua

red

Fit with parametric log-logistic model

1( )

1 exp( log( ) )P d x

x

log1

Py

P

-2 -1.5 -1 -0.5 0 0.5 1 1.5-15

-10

-5

0

5

10

15

DataLinear Fit

log(x)

Gives 2.3 fp’sper song for 106

fingerprints

Tests for Other Distortions

• Take first 10 test clips

• Distort each in 7 ways

• Further distort with 2% pitch-preserving time compression (distortion not in training set)

Test Results on Distorted Data

0 10 20 30 40 50 60 70 80 90 1000.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

Smallest 100 of 39,920 target /non-target values

0 10 20 30 40 50 60 70 800

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

No

rmal

ized

dis

tan

ce s

qu

are

d

Time compressed distortions, not sortedAll target/target values, sorted

Computational Cost

On a 750 MHz PIII, computing features on a stream and searching for fingerprints, using simple sequential scan, for 10,000 audio clips, takes 12MB memory and 10% of CPU.

Conclusions

• DDA features work well for audio fingerprinting. How about other audio tasks?

• A layered approach is necessary to keep DDA computation feasible.

• This is a convolutional neural net, with weights chosen by OPCA and linear transfer functions.

• Demo

http://research.microsoft.com/~cburges/tech_reports/dda.pdf

Demo (1 min. 14 sec)

• Repeated musical phrase

• Voice

• Clean clip C

• C + equalization

• C + equalization + reverb

• C + equalization + reverb + repeat echo

Documents

Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign