136
A Variational Parametric Model forAudio Synthesis
Krishna SubramaniSupervised by Prof Preeti Rao
Electrical EngineeringIndian Institute of Technology Bombay India
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
Figure One of the early Moog Modular Synthesizers
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I More generally it involves us specifying controlling parametersto a synthesizer to obtain an audio output
I What are the main parameters which govern the audiogeneration(at a high level)
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
1 Timbre
2 Pitch
3 Loudness
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
336
Generative Models for Audio Synthesis
I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data
I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods
I Most methods in literature try to model audio signals directlyin the time or frequency domain
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
Figure One of the early Moog Modular Synthesizers
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I More generally it involves us specifying controlling parametersto a synthesizer to obtain an audio output
I What are the main parameters which govern the audiogeneration(at a high level)
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
1 Timbre
2 Pitch
3 Loudness
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
336
Generative Models for Audio Synthesis
I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data
I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods
I Most methods in literature try to model audio signals directlyin the time or frequency domain
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
Figure One of the early Moog Modular Synthesizers
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I More generally it involves us specifying controlling parametersto a synthesizer to obtain an audio output
I What are the main parameters which govern the audiogeneration(at a high level)
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
1 Timbre
2 Pitch
3 Loudness
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
336
Generative Models for Audio Synthesis
I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data
I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods
I Most methods in literature try to model audio signals directlyin the time or frequency domain
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I More generally it involves us specifying controlling parametersto a synthesizer to obtain an audio output
I What are the main parameters which govern the audiogeneration(at a high level)
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
1 Timbre
2 Pitch
3 Loudness
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
336
Generative Models for Audio Synthesis
I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data
I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods
I Most methods in literature try to model audio signals directlyin the time or frequency domain
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
1 Timbre
2 Pitch
3 Loudness
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
336
Generative Models for Audio Synthesis
I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data
I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods
I Most methods in literature try to model audio signals directlyin the time or frequency domain
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
336
Generative Models for Audio Synthesis
I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data
I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods
I Most methods in literature try to model audio signals directlyin the time or frequency domain
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
236
Audio Synthesis
I What comes to your mind when you hear lsquoAudio Synthesisrsquo
I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it
I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis
336
Generative Models for Audio Synthesis
I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data
I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods
I Most methods in literature try to model audio signals directlyin the time or frequency domain
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
336
Generative Models for Audio Synthesis
I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data
I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods
I Most methods in literature try to model audio signals directlyin the time or frequency domain
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]
X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
436
Our Nearest Neighbours
I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra
X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis
times lsquoGraininessrsquo in the reconstructed sound
I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures
X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation
I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments
X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative sounds
times Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
536
Our Nearest Neighbours
I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution
I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis
X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from
I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class
X Was able to achieve better control over generation byconditioning
times Unable to generalize to untrained pitches
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
636
Why Parametric
I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics
I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]
I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data
1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
736
Dataset
I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments
I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)
Why we chose ViolinPopular in Indian Music Human voice-like timbre
Ability to produce continuous pitch
0 5 10 15 20 25 30 35
606162636465666768697071
282626262626
343030
262626
Instances
MID
I
Figure Instances per note in the overall dataset
I Average note duration for the chosen octave is about 45s pernote
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
836
Dataset
I We split the data to train(80) and test(20) instances acrossMIDI note labels
I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
936
Non-Parametric Reconstruction
I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches
I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side
Sustain
FFT
En
cod
er
Z
Dec
od
er
AE
I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1036
Non-Parametric Reconstruction
Figure Input MIDI 63 1 1
Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
1728
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false
1728
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1136
Parametric Model
1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)
Sustain
FFTHpR
f0Magnitudes
I Output of HpR block =rArr log-dB magnitudes + harmonics
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1236
Parametric Model
2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]
I TAE =rArr Iterative Cepstral Liftering
A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))
Vi = FFT (lifter(Ai ))
f0Magnitudes
TAE
CCs
f0
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1336
Parametric Model
I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]
I Figure shows a TAE snap from [Caetano and Rodet 2012]
I Similar to results we get 1 4 2 5
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false
3216
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false
2256
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1436
Parametric Model
0 1 2 3 4 5minus80
minus60
minus40
minus20
Frequency(kHz)Magnitude(dB)
606571
I Spectral envelope shape varies across pitch
1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]
2 Variation due the TAE algorithm
I TAE rarr smooth function to estimate harmonic amplitudes
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1536
Parametric Model
I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)
Kcc leFs2f0
I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1636
Generative Models
I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction
bull Simple to train and performs good reconstruction(since theyminimize MSE)
bull Not truly a generative model as you cannot generate new data
I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space
bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior
I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1736
Generative Models
Why VAE over AE
Continuous latent space from which we can sample points(andsynthesize the corresponding audio)
Why CVAE over VAE
Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1836
Network Architecture
CCs
f0
En
cod
er
Z
Dec
od
er
CVAE
I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes
I We train the network on frames from the train instances
I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
L prop MSE + βKLD
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
1936
Network ArchitectureI Main hyperparameters -
1 β - Controls relative weighting between reconstruction andprior enforcement
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
β = 001β = 01β = 1
Figure CVAE varying β
I Tradeoff between both terms choose β = 01
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2036
Network Architecture
I Main hyperparameters -
2 Dimensionality of latent space - networks reconstruction ability
0 10 20 30 40 50 60 70
2
4
6
8middot10minus5
Latent space dimensionality
MSE
CVAEAE
Figure CVAE(β = 01) vs AE
I Steep fall initially flatter later Choose dimensionality = 32
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2136
Network Architecture
I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space
dimensionality
I Linear Fully Connected Layers + Leaky ReLU activations
I ADAM [Kingma and Ba 2014] with initial lr = 10minus3
I Training for 2000 epochs with batch size 512
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2236
Experiments
I Two kinds of experiments to demonstrate networks capabilities
1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch
2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2336
Reconstruction
I Two training contexts -
1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X
2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X
I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2436
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAE
I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8
I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11
I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false
1704
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false
1536
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false
1536
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2536
Reconstruction
I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below
f0
CCscAE
fprime
0
CCsprime
Figure lsquoConditionalrsquo AE(cAE)
I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model
I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2636
Reconstruction
60 62 64 66 68 70 72
1
2
3
4
5middot10minus2
Pitch
MSE
AECVAEcAE
I We train the cAE on the octave endpoints and evaluateperformance as shown above
I Appending f0 does not improve performance It is infact worsethan AE
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2736
Generation
I We are interested in lsquosynthesizingrsquo audio
I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)
Z
Dec
od
er
f0
Figure Sampling from the Network
I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2836
Generation
I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65
I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing
1 12 2 13 3 14
I For a more realistic synthesis we generate a violin note withvibrato 4 15
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false
2256
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false
2184
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false
2184
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
2936
Putting it all together
I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones
I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies
I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3036
Putting it all togetherI Our model is still far from perfect however and we have to
improve it further
1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural
2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics
3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music
I Our contributions
1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE
2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3136
References I
[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774
[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE
[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908
[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org
[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3236
References II
[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243
[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507
[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228
[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980
[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114
[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3336
References III
[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096
[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b
[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society
[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC
[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3436
References IV
[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141
[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models
In Advances in neural information processing systems pages 3483ndash3491
[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3536
Audio examples description I
1 Input MIDI 63 to Spectral Model
2 Spectral Model Reconstruction(trained on MIDI63)
3 Spectral Model Reconstruction(not trained on MIDI63)
4 Input MIDI 60 note to Parametric Model
5 Parametric Reconstruction of input note
6 Input MIDI 63 Note
7 Parametric AE reconstruction of input
8 Parametric CVAE reconstruction of input
9 Input MIDI 65 note(endpoint trained model)
10 Parametric AE reconstruction of input(endpoint trained model)
11 Parametric CVAE reconstruction of input(endpoint trained model)
12 CVAE Generated MIDI 65 Violin note
13 Similar MIDI 65 Violin note from dataset
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato
3636
Audio examples description II
14 CVAE Reconstruction of the MIDI 65 violin note
15 CVAE Generated MIDI 65 Violin note with vibrato