Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski

Statistical automatic identification of microchiroptera from echolocation calls

Lessons learned from human automatic speech recognition

Mark D. Skowronski and John G. Harris

Computational Neuro-Engineering Lab

Electrical and Computer Engineering

University of Florida

Gainesville, FL, USA

November 19, 2004

Overview• Motivations for bat acoustic research

• Review bat call classification methods

• Contrast with 1970s human ASR

• Experiments

• Conclusions

Bat research motivations• Bats are among:

– the most diverse,– the most endangered,– and the least studied mammals.

• Close relationship with insects– agricultural impact– disease vectors

• Acoustical research non-invasive, significant domain (echolocation)

• Simplified biological acoustic communication system (compared to human speech)

Echolocation calls• Features (holistic)

– Frequency extrema– Duration– Shape– # harmonics– Call interval

Mexican free-tailed calls, concatenated

Current classification methods

• Expert spectrogram readers– Manual or automatic feature extraction– Comparison with exemplar spectrograms

• Automatic classification– Decision trees– Discriminant function analysis

Parallels the knowledge-based approach to human ASR from the 1970s (acoustic phonetics, expert systems, cognitive approach).

Acoustic phonetics

• Bottom up paradigm– Frames, boundaries, groups, phonemes, words

• Manual or automatic feature extraction– Determined by experts to be important for speech

• Classification– Decision tree, discriminant functions, neural network,

Gaussian mixture model, Viterbi path

DH AH F UH T B AO L G EY EM IH Z OW V ER

Acoustic phonetics limitations

• Variability of conversational speech– Complex rules, difficult to implement

• Feature estimates brittle– Variable noise robustness

• Hard decisions, errors accumulate

Shifted to information theoretic (machine learning) paradigm of human ASR, better able to account for variability of speech, noise.

Information theoretic ASR• Data-driven models from computer

science– Non-parametric: dynamic time warp (DTW)– Parametric: hidden Markov model (HMM)

• Frame-based– Expert information in feature extraction– Models account for feature, temporal

variability

Data collection• UF Bat House, home to 60,000 bats

– Mexican free-tailed bat (vast majority)– Evening bat– Southeastern myotis

• Continuous recording– 90 minutes around sunset– ~20,000 calls

• Equipment:– B&K mic (4939), 100 kHz– B&K preamp (2670)– Custom amp/AA filter– NI 6036E 200kS/s A/D card– Laptop, Matlab

Experiment design• Hand labels

– 436 calls (2% of data)– Four classes, a priori: 34, 40, 20, 6%– All experiments on hand-labeled data only– No hand-labeled calls excluded from experiments

1 2 3 4

Experiments• Baseline

– Features• Zero crossing• MUSIC super resolution frequency estimator

– Classifier• Discriminant function analysis, quadratic boundaries

• DTW and HMM– Features

• Frequency (MUSIC), log energy, first derivatives (HMM only)

– HMM• 5 states/model• 4 Gaussian mixtures/state• diagonal covariances

Results• Baseline, zero crossing

– Leave one out: 72.5% correct– Repeated trials: 72.5 ± 4% (mean ± std)

• Baseline, MUSIC– Leave one out: 79.1%– Repeated trials: 77.5 ± 4%

• DTW, MUSIC– Leave one out: 74.5 %– Repeated trials: 74.1 ± 4%

• HMM, MUSIC– Test on train: 85.3 %

Confusion matrices1 2 3 4

1 107 38 1 2 72.3%

2 21 134 16 4 76.6%

3 2 29 57 0 64.8%

4 4 3 0 18 72.0%

72.5%

Baseline, zero crossing Baseline, MUSIC

DTW, MUSIC HMM, MUSIC

1 2 3 4

1 110 36 1 1 74.3%

2 12 149 12 2 85.1%

3 4 18 66 0 75.0%

4 3 2 0 20 80.0%

79.1%

1 2 3 4

1 115 29 0 4 77.7%

2 32 131 11 1 74.9%

3 5 20 63 0 71.6%

4 5 4 0 16 64.0%

74.5%

1 2 3 4

1 118 25 0 5 79.7%

2 10 154 5 6 88.0%

3 1 12 75 0 85.2%

4 0 0 0 25 100%

85.3%

Conclusions• Human ASR algorithms applicable to bat

echolocation calls• Experiments

– Weakness: accuracy of class labels– HMM most accurate, undertrained– MUSIC frequency estimate robust, slow

• Machine learning– DTW: fast training, slow classification– HMM: slow training, fast classification

Further information• http://www.cnel.ufl.edu/~markskow• [email protected]• DTW reference:

– L. Rabiner and B. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1993

• HMM reference:– L. Rabiner, “A tutorial on hidden Markov models and

selected applications in speech recognition,” in Readings in Speech Recognition, A. Waibel and K.-F. Lee, Eds., pp. 267–296. Kaufmann, San Mateo, CA, 1990.

Documents

Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski