SpeechSeparationUsingConvolutionalNeuralNetworkand ...downloads.hindawi.com/journals/ddns/2020/2196893.pdfcharacteristics of the convolutional neural network and attention mechanism,

Research ArticleSpeech Separation Using Convolutional Neural Network andAttention Mechanism

Chun-Miao Yuan Xue-Mei Sun and Hu Zhao

School of Computer Science and Technology TianGong University Tianjin 300387 China

Correspondence should be addressed to Xue-Mei Sun sunxuemeitiangongeducn

Received 14 May 2020 Accepted 8 July 2020 Published 25 July 2020

Academic Editor Jianquan Lu

Copyright copy 2020 Chun-Miao Yuan et al +is is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Speech information is the most important means of human communication and it is crucial to separate the target voice from themixed sound signals +is paper proposes a speech separation model based on convolutional neural networks and attentionmechanism +e magnitude spectrum of the mixed speech signals as the input has its high dimensionality By analyzing thecharacteristics of the convolutional neural network and attention mechanism it can be found that the convolutional neuralnetwork can effectively extract low-dimensional features and mine the spatiotemporal structure information in the speech signalsand the attention mechanism can reduce the loss of sequence information +e accuracy of speech separation can be improvedeffectively by combining two mechanisms Compared to the typical speech separation model DRNN-2 + discrim this methodachieves 027 dB GNSDR gain and 051 dB GSIR gain which illustrates that the speech separation model proposed in this paperhas achieved an ideal separation effect

1 Introduction

Voice information plays an increasingly important role inour lives and voice communication becomes more andmore frequent such as using chatting software to send voicemessages using voice to control mobile phone applicationsmaking mobile phone calls for voice communication rec-ognizing the singers from songs [1] and identifying singerrsquosinformation lyrics and song style [2 3] +e goal of speechseparation is to separate the mixed speech into two originalspeech signals In signal processing speech separation is abasic task with a wide range of applications includingmobile communication speaker recognition and songseparation +ere are many potential values for separatingmixed speech Nowadays speech separation plays a moreandmore important role in speech processing andmore andmore devices need to carry out speech separation task

Although humans can easily perform speech separationit is very challenging to build an automatic system thatmatches the human auditory system +erefore speechseparation has always been an important research direction

of speech processing +e early speech separation method isvery limited in the ability of mining nonlinear structureinformation and the performance of monaural speechseparation has been unsatisfactory With the development ofdeep neural networks in recent years [4ndash6] good results havebeen achieved in various fields Compared with traditionalspeech methods there are many advantages for deep neuralnetwork-based speech separation models +e main con-tribution of this paper is to apply the convolution neuralnetwork to the speech separation tasks use the multilayernonlinear processing structure of the convolution neuralnetwork to mine the structure information in the speechsignals automatically extract the abstract features integratethe attention mechanism to reduce the loss of the sequenceinformation and finally achieve the monaural speechseparation

2 Relative Research

Speech separation problem has been widely studiedworldwide +e traditional methods of monaural speech

HindawiDiscrete Dynamics in Nature and SocietyVolume 2020 Article ID 2196893 10 pageshttpsdoiorg10115520202196893

separation are speech enhancement which analyze the dataof the mixed speech and the noise and then estimate thetarget speech through the interference estimation of themixed speech From the viewpoint of signal processingmany speech enhancement methods propose a powerspectrum of estimated noise or an ideal Wiener wave re-corder such as spectral subtraction [7] andWiener filter [8]+ere are many other ways to achieve speech enhancement(eg through Bayesian model [9] or time-frequencymasking methods [10]) Some researchers tried to usecomputer auditory scene analysis (CASA) to achieve speechseparation which is the use of computer technology to allowcomputers to imitate the processing of human auditorysignals for modeling so as to have the ability to perceivesound process sound and interpret sound from complexmixed sound sources such as humans +e basic compu-tational goal of auditory scene analysis is to estimate an idealbinary mask to achieve speech separation based on auditorymasking of the human ears [11ndash13] Speech enhancementtechnology generally assumes that the interference speech isstable so its separation performance will be severely de-graded when the actual interference noise is nonstationaryComputational auditory scene analysis with better gener-alization performance has no assumptions about noiseMeanwhile the computational auditory scene analysisheavily relies on speech pitch detection which is very dif-ficult under the interference of background sounds

Speech separation is designed to isolate useful signalsfrom disturbed speech signals a process that can beexpressed naturally as a supervised learning problem [14] Atypical supervised speech separation system learns a map-ping function from noisy features to separation targets (egideal masking or magnitude spectrum of speech interested)through supervised learning algorithms such as deep neuralnetworks Many supervised speech separation methods havebeen proposed in recent years [15ndash18] +e learning modelsfor supervised speech separation are mainly divided into twokinds (1) Traditional methods such asmodel-basedmethodsand speech enhancement methods (2) Newer methodsusing DNNs (Deep Neural Networks) Due to the speechgeneration mechanism the input features and output targetsof the speech separation show obvious spatiotemporalstructure +ese characteristics are very suitable for mod-eling with deep models Many deep models are widely usedin speech separation Sun et al [19] proposed two-stageapproach with two DNN-based methods to address theproblem of the performance of current speech separationmethods New training targets in complement of existingmagnitude training targets were trained through neuralnetwork methods to compensate for phase of target in orderto achieve better separation performance by the authors of[20] Zhou et al [21] designed a separation system on Re-current Neural Network (RNN) with long short-termmemory (LSTM) which effectively learns the temporaldynamics of spatial features Supervised speech separationdoes not require spatial orientation information of the soundsources and there are no restrictions on the statisticalcharacteristics of noise It shows obvious advantages and afairly bright research prospective under the conditions of

monaural nonstationary noise and low signal-to-noise ratio[22 23]

Deep Recurrent Neural Network (DRNN) among themis a representative of deep models and has been widely usedin speech separation DRNN has strong learning ability inspeech separation RNN series of units such as LSTM [24]GRU (Gated Recurrent Unit GRU) [25] all of whose hiddenstates are calculated according to the Markov model +eprevious hidden state will save some previous informationwhile the magnitude spectrum of the mixed speech is arelatively long sequence and sequence information will belost (after all the capacity of the cell state is limited inpractice) which will affect the separation of mixed speechand reduce the accuracy of estimated speech

Convolutional neural network (CNN) has been widelyused in deep learning since it was firstly proposed by YannLecun et al [26] in 1998 CNN has its natural advantages intwo-dimensional signal processing and its powerful mod-eling capabilities have been verified in tasks such as imagerecognition CNN at present has been applied to speechseparation [27 28] and has achieved the best separationperformance under the same conditions which has exceededthe DNN-based speech separation systems Luo et al [29]proposed Conv-TasNet which was a fully convolutionaltime-domain audio separation network to solve the prob-lems of time-frequency masking Wang et al [30] appliedCNN and modified its loss function to solve the imbalancebetween classification accuracy hit rate and error rate

Both the traditional and DNN-based separation modelshave achieved good results but they all have correspondingshortcomings Convolutional neural network can make useof the spatial connection of input data so that each elementcan learn local features without learning global features andthen at a higher-level these local features will be combinedtogether to get global features Weight sharing can reducethe parameters between different neurons reduce the cal-culation time and improve the speed of the model By usinga variety of convolution filters multiple feature maps can beobtained which can detect the same type of features indifferent positions and ensure the invariance of displace-ment and deformation to a certain extent +erefore thispaper proposes a method based on convolutional neuralnetwork to solve the problem of the loss of long sequenceinformation of mixed speech Combined with the attentionmechanism our model can better focus on the timing se-quence step which contributes the most and solve theproblem of insufficient memory of the temporal model to acertain extent so as to improve the speech separation effect

3 System Modeling

+e purpose of the speech separation is to separate the targetvoice from the background sound +e overall structure ofthe speech separation system is shown in Figure 1 +e firstprocess of speech separation is to obtain the magnitudespectrum and phase information of mixed speech throughSTFT (Short-time Fourier Transform) [31] +e estimatedtargets using speech magnitude spectrum have been shownto suppress noise significantly and improve the speech

2 Discrete Dynamics in Nature and Society

intelligibility and its perceived quality +en the magnitudespectrum information is used as the input of the speechseparation model +e magnitude spectrum is trained byconvolutional neural network and the region of interest ofspeech is extracted by the attention module Finally throughthe overlap-add method [32] the target speech is obtainedby the combination of the magnitude spectrum informationestimated by the separation model and the previous phaseinformation +e speech separation ability of the model isevaluated based on the comparison between the estimatedspeech and the pure speech

4 Proposed Methods

Deep models with multilevel nonlinear structures good atmining structural information in data which can automaticallyextract abstract feature representations have been widely usedin image video and speech processing +is paper presents amonaural speech separation model based on the convolutionalneural network and attention mechanism we name it CASSM+ere are two main modules in this model convolutionalneural network module and attention mechanism both ofwhich are jointly optimized to complete the task of speechseparation In the convolutional neural network module it canextract features from the magnitude spectrum of the mixedspeech reduce the dimension of the magnitude spectrum andmodel the current and contextual information effectivelyAlthough convolutional neural networks can extract themagnitude spectrum features of mixed speech signals effec-tively they cannot model sequence information in the sameway +e attention mechanism RNN is good at processingsequence information however there are shortcomings inextracting information features versus convolutional neuralnetworks After the training of neural network it can recognizethe difference of different signal sources at different frame levelseffectively the attention mechanism-based neural network canrecognize the importance of each part of the magnitudespectrum with the help of context information [33] thusconsiderably improving the effect of speech separation

41 Network Structure +e network structure of the modelis shown in Figure 2 +e magnitude spectrum of the mixedspeech is used as the input of the neural network the

estimated magnitude spectrum is obtained by training theneural network and the original phase is combined to get theestimated speech by ISTFT (inverse short-time Fouriertransform)

+e amplitude spectrum of mixed speech has high di-mensionality when it acts as the input of neural networkCNN is good at mining the spatiotemporal structure in-formation of the input signal and can extract the local featureinformation of the magnitude spectrum which can improvethe performance of speech separation

+e magnitude spectrum of the mixed speech is used asthe network input and then forward propagates along twopaths one of which is used as the input of the convolutionallayer while the other forms the combination sequence of theinput layer with the low-dimensional sequence that has beenprocessed by the convolution +e input informationtransmits through two convolutional filters with differentsizes and important information about the mixed speechmagnitude spectrum is able to be extracted

We use relu as an activation function+e output of the twoconvolutions is superimposed and processed by the maximumpooling layer which provides strong robustness increases theinvariance of the current information speeds up the calculationspeed and prevents overfitting +e sequence continuesthrough several convolutional layers that can be effectivelymodeled for current as well as contextual information to obtaina low-dimensional embedding sequence +e low-dimensionaland high-dimensional embedded sequences are superimposedinto a combined one which passes through two-tier high-speednetworks to extract higher-level features

Still there are many shortcomings in processing seri-alized information for speech separation although con-volutional neural networks can extract features from themagnitude spectrum of mixed speech reduce the dimensionof the magnitude spectrum and model the current andcontextual information effectively Comparatively the at-tention mechanism RNN is much skilled in processing se-rialized information helpful in strengthening thedependence between the magnitude spectrum and canjointly fulfill the task of speech separation with the con-volutional neural network module

+e attention mechanism is used for the input of themixed speech magnitude spectrum so that the model paysdifferent attention to the information features from different

Mixturesignal

Evaluation

STFT

ISTFT

CNNlayer

Attentionlayer

Magnitudespectra

Phasespectra

Estimatedmagnitude

spectra

Figure 1 +e modeling of speech separation

Discrete Dynamics in Nature and Society 3

moments +e traditional recurrent neural network usesonly the information of state [tminus1] when calculating the state[t] at the t moment Since the magnitude spectrum is a longsequence there will be a loss of information although theprevious state [tminus1] contains some previous information+e attention mechanism uses the previous information tocalculate the state information at the current moment +eattention mechanism used in this paper is based on a layer ofRNN (LSTM unit) with an attention length of 32

Long Short-TermMemory (LSTM) is a specific recurrentneural network (RNN) architecture that was designed tomodel temporal sequences and their long-range

dependencies more accurately than conventional RNNsLSTM RNNs are more effective than DNNs and conven-tional RNNs for acoustic modeling considering moderatelysized models trained on a single machine [34] LSTM RNNmakes more effective use of model parameters than theothers considered converges quickly and outperforms adeep feedforward neural network having an order ofmagnitude more parameters

A bidirectional recurrent neural network and twofeedforward layers are used to form a separation moduleafter the attention mechanism RNN

42 Short-Term Fourier Transform +e spectrum of speechsignals has its significant role in speech through whichsome nontrivial speech features can be obtained Stationarysignals refer to the signals that the distribution law does notchange over time while nonstationary signals are signalsthat change over time +e mixed speech signals belong tothe nonstationary signals and the speech signals have thecharacteristics of nonstationary one

+e Fourier transform (FT) is ubiquitous in science andengineering For example it finds application in the solutionof equations for the flow of heat for the diffraction ofelectromagnetic radiation and for the analysis of electricalcircuits +e concept of the FT lies at the core of modernelectrical engineering and is a unifying concept that connectsseemingly different fields+e STFT is a fine resolution to theproblem of determining the frequency spectrum of signalswith time-varying frequency spectra Fourier transform isthe process of conversion between time domain and fre-quency +e Fourier transform is in its essence a reversibletransform that can transform the original signals and thetransformed signals into each other +e characteristics ofthe Fourier transform are not suitable for analyzing non-stationary signals which cannot directly represent speechsignals +e speech signals change slowly with the change oftime which even can be regarded as unchanged in a shortperiod of time Analysis of the frequency domain infor-mation in a short time is the short-time Fourier transform

Short-time Fourier transform (STFT) is a commonlyused tool for processing speech signals +ere is only onedifference between the Fourier transform and the short-timeFourier transform +e short-time Fourier transform is todivide the speech signals into sufficiently small fragments sothat the speech signals can be seen as a stable one +e short-time Fourier transform formula is defined as

STFT(tω) 1113946infin

minusinfins(x)c(x minus t)e

minus jωxdx (1)

In which c(t) is the length M window function s(x) de-notes an input signal at time x to be transformed and STFT (tω) is a two-dimensional complex function which representsthe magnitude and phase that change over time and frequency

In supervised speech separation feature extraction is anindispensable process and the selection of features will affect thespeech separation model training From the point of theextracted basic units the features of speech separation aremainlydivided into time-frequency unit-level features and frame-level

Attention mechanismRNN

Bi-directionalRNN

Estimated magnitude spectra

Magnitude spectra

Max pooling

Convolutionallayer

Convolutionallayer

Normalization

High speed network

Convolutionallayer

Convolutionallayer

Input layer

Figure 2 Network architecture of proposed CASSM


ones Frame-level features are extracted from a frame of signals+is level of features has a larger granularity and can capture thespatiotemporal structure of speech especially the frequencyband correlation of speech It is more general and holistic andhas clear characteristics of speech perception

+e feature extraction in this paper is based on the framelevel +e short-time Fourier transform is used to frame andwindow the speech signals In each frame the magnitudespectrum and phase in this paper are obtained by 1024-pointSTFT +e speech separation model can better display theinformation of speech by using frame-level features and canmine the space-time structure of speech in all directionsduring training +e magnitude spectrum obtained by usingthe short-time Fourier transform in this paper can makemodel training easier and accelerate the model convergence

43 Attention Mechanisms In recent years the attentionmechanism has been widely used by many scholars in deeplearning models in different fields such as speech recognitionand image recognition Due to the limitations of informationprocessing whenmuch information is processed people usuallychoose some of the important parts for processing while ig-noring other parts of the information +is idea is the origin ofthe attention mechanism When people are reading they willpaymore attention to and deal with some important vocabularyand ignore some less important parts which will speed up thereading and improve reading efficiency Attention mechanismnow has become a widely used tool thus attention mechanismis one of the crucial important algorithms in deep learningalgorithms that need the most attention and understandingHowever attention mechanism is currently not widely used inspeech separation therefore this paper attempts to introduceattention mechanism into the process of speech separationfocusing on the region of interest in speech feature informationso as to improve the accuracy of speech separation

Strictly speaking attention mechanism is an idea ratherthan an implementation of some models which can beimplemented in a variety of ways Due to the bottleneck ofinformation processing attention mechanism requires todecide which part may be the focus allocate the limitedinformation processing resources to more important partsand specially focus on the key information

+ere are many variants of the attention mechanism andthe attention mechanism used in this paper is AttentionCell-Wrapper in TensorFlow [35] which is a general attentionmechanism +is attention structure can be used through aone-way recurrent neural network While processing the inputof each step it will consider the output of the previous N stepsand add the previous historical information to the prediction ofthe current input through a mapping weighting method

Uti V

Ttan h Wαht + Wβoi1113872 1113873

ati softmax U

ti1113872 1113873

htprime 1113944

tminus1

i1a

tioi

(2)

In which the vector V and matrix Wα Wβ are thelearnable parameters of the model in the formula ht is thematrix of the current hidden-layer state of the LSTM and theLSTM unit connection and oi is the output of the ith LSTMunit +e attention weights at

i of each moment are nor-malized calculated by the softmax function Finally con-necting ht with ht

prime becomes the predicted new hidden stateand is also fed back to the next step of the model

It becomes possible for the simple timing model to usethe attention mechanism by using this general attentionstructure designed by TensorFlow which can make theentire model more focus on the timing step that contributesthe most and solve the timing model with memory prob-lems to some extent

44 Time-Frequency Masking Calculation Layer Consideringthe constraints of TF (time-frequency) masking between theforced input mixed speech signals and the estimated speechsignals as well as the benefits of smooth separation a time-frequency masking calculation layer is used in the speechseparation model to jointly optimize the TF model with theentire deep learning model +e time-frequency maskingcalculation layer is defined as

1113957y1 1113954y1

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

1113957y2 1113954y2

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

(3)

where 1113957y1 and 1113957y2 are the magnitude spectrums of the es-timated speech signals 1113954y1 and 1113954y2 are the output magnitudespectrums of the two feedforward networks of the speechseparation module and z is the magnitude spectrum of theinput mixed speech signals

+e masking effect is a phenomenon that occurs whenthe human ears perceive speech A louder sound will mask alower one If the difference between the two sound fre-quencies is small the effect of the masking effect becomesmore obvious +e masking effect is of great significance inspeech processing

45 Loss Function +e loss function is used to describe thedifference between the predicted value and the target value+e loss function is generally positive and the size of the lossfunction reflects the quality of the model When the modelperforms speech separation the value of the loss function isrelatively small if the model speech separation works well+e size of the loss function directly reflects the speechseparation effect of the model +e loss function of thetraining network model in this paper is defined as [36]

JDIS y1 minus 1113957y21

+ y2 minus 1113957y

22

minus c y1 minus 1113957y21

minus c y2 minus 1113957y

22

cgt 0

(4)

where y1 and y2 are the ground truth of magnitude spec-trum 1113957y1 and 1113957y2 are the estimated magnitude spectrum ofspeech and 0le cle 1 is a regularization parameter


5 Simulation Experiments and Analysis

In this paper experiments were performed on aWindows 10Professional 64-bit computer with hardware configurationof i5 8GB RAM and GT1060X 6G graphics card +eprogram was written in PyCharm software using Python

+e MIR-1K dataset [37] was used to evaluate the model+e dataset has 1000 mixed music clips encoded at a 16KHzsampling rate with a duration from 4 to 13 seconds +e clipswere extracted from 110 Chinese karaoke songs performed byboth male and female amateurs +e singing voice andbackground music were mixed into a music fragment with asignal-to-noise ratio of 0 dB In this dataset 175 clips were usedfor the training sets and 825 were used for the testing sets

Speech separation usually uses three parameters of BSS-EVAL [38] to verify the performance of the model SDR(Signal to Distortion Ratio) SIR (Signal to InterferenceRatio) and SAR (Signal to Artifact Ratio) SDR is the ratiobetween the power of mixed speech signals and the dif-ference value of the mixed speech signals and the targetspeech signals SIR is the total power ratio between the targetspeech signals and the interference signals SAR stands forartifacts introduced by processing speech signals SDRcalculates how much total distortion exists in the separatedsound A higher SDR value indicates a smaller overalldistortion of the speech separation system SIR directlycompares the degree of speech separation between nontargetsound and target sound A higher SAR value indicates asmaller introduced error of the speech separation system Ahigher value of the three parameters indicates a better effectof the speech separation model

In this experiment in order to evaluate the effect of thespeech separation model the global NSDR (GNSDR) globalSIR (GSIR) and global SAR (GSAR) were used to evaluatethe overall effect all of which were the mean value of thetested fragments based on their lengths calculated +enormalized SDR (NSDR) [39] is defined as

NSDR(1113954o o m) SDR(1113954o o) minus SDR(m o) (5)

In which o is the original pure signal 1113954o is the estimatedspeech signal and m is the mixed speech signals NSDR isused to estimate the SDR improvement of mixed speechsignal m and speech signal 1113954o

We conducted multiple comparative experiments tocomprehensively evaluate the validity and reliability of theproposed model Firstly we tested the influence of LSTMand GRU gating units on the effect of speech separationSecondly we verified the improvement of the separationeffect of the traditional speech separation models by theattention mechanism +en we compared the separationeffect between DRNN-2 + discrim proposed by Huang et aland CASSM proposed in this paper Finally we explored theeffect of attention length on the effect of speech separation

51 2e Influence of LSTM and GRU on Speech SeparationModel +e recurrent neural network has the function ofshort-term memory but if the training sequence is longenough it is difficult for the network to transmit the more

important information from the previous to the later pro-cess +erefore if the speech problem is processed the re-current neural network will miss some importantinformation resulting in a reduction in processing powerand a negative impact on speech processing results Anotherproblem is that in the later backpropagation phase therecurrent neural network will encounter the problem ofgradient disappearance which is the weight value ofupdating each layer in the network Gradient disappearancemeans that the gradient value will continue to decrease as thetraining proceeds and when the gradient drops to a verysmall amount the neural network will not continue to learnand the training process will stop which may affect thetraining results

In order to solve these problems two methods LSTMunit (Long Short-Term Memory unit) [24] and GRU (GatedRecurrent Unit) [25] have been proposed through the ef-forts of many researchers Both types of units have internalmechanisms gates which can regulate the flow of infor-mation to achieve the function of memory +ese two gatingunits have different characteristics In our experiments theeffect from the two types of gating units LSTM and GRU onthe results of the speech separation model was evaluated

Firstly the effect on the separation model was reflectedby using the different gating units of the three-layer uni-directional recurrent neural network layer RSSM (RNN-based Speech Separation Model) +e speech separationresults of each model were described by three parametersGNSDR GSIR and GSAR +e experimental results of thetwo gating units are shown in Figure 3

+e effect of different gating units was analyzed from theexperimental results in the Figure 3 +e speech separationmodel using LSTM gating units obtained a gain of 007 dB onthe GNSDR a gain of 039 dB on the GSIR and a gain ofminus009 dB on the GSAR relative to the GRU

+e ASSM (Attention-based Speech Separation Model)was mainly used to analyze the effect of attention mechanismon speech separation and to verify the effectiveness of attentionmechanism in speech separation Next the ASSM was used toevaluate the effect from the two gated units LSTM and GRUon the results of the speech separation model in the attentionmechanism+eASSMwasmainly used to analyze the effect ofattention mechanism on speech separation and verify the ef-fectiveness of attentionmechanism in speech separation ASSMwas composed of fully connected layer attention mechanismRNN and bidirectional RNN

+e experimental results in Figure 4 showed three in-dicators of speech separation In the experiment the ASSMusing the LSTM unit was 005 dB higher in GNSDR and037 dB in GSIR than that of using GRU unit From theexperimental results of the two groups it can be concludedthat LSTM is superior to GRU on GNSDR and GSIR+erefore LSTM was employed in our proposed model interms of the attention mechanism

52 CASSM Performance Verification Experiment +is ex-periment mainly aimed to verify the effect of the speechseparation model proposed in this paper Since the results of


the speech separation model based on recurrent neuralnetwork were not very satisfactory inspired by attentionmechanism and convolutional neural network this paperproposed a speech separation model based on attentionmechanism and convolutional neural network CASSM andthe effectiveness of the model would be verified through thefollowing experiments

Huang et al [36] applied the recurrent neural network tospeech separation and proposed a speech separation variantmodel based on deep recurrent neural network +e speechseparation effect varies from different models employed Inprevious research masking of monochannel signal separa-tion and the joint optimization of deep recurrent neuralnetwork have achieved better speech separation perfor-mance In our experiment a comparison was made amongthese models CASSM the DRNN-2 + discrim model with

the best separation effect achieved by Huang et al [36]RNMF (Robust Low-Rank Nonnegative Matrix Factoriza-tion) and MLRR (Multiple Low-Rank Representation)

+e attention mechanism used was LSTM unit with anattention length of 32+e experimental results are shown inFigure 5 Compared with MLRR and RNMF the value ofCASSM was 387 dB and 275 dB higher in GNSDR 796 dBand 593 dB higher in GSIR and minus106 dB and -039 dB lowerin GSAR Experiments demonstrated that convolutionalneural network plus attention mechanism had greater ad-vantages over traditional speech separation models Con-volutional neural network can more fully extract soundspectrum features Attention mechanism can strengthen thedependence between magnitude spectrum features andimprove the speech separation performance Compared withDRNN-2 + discrim the improvement of CASSM was027 dB higher in GNSDR and 051 dB higher in GSIRExperiments illustrated that there was still a gap betweenDRNN-2 + discrim and CASSM in terms of processingamplitude spectra In DRNN-2 + discrim the originalmagnitude spectrum of the mixed speech was directly usedas the input features while the CNNmodule of CASSM usedthe combined sequence formed by the high-dimensionalmagnitude spectrum of the mixed speech as the inputfeatures Meanwhile a better speech separation effect in theexperiment was achieved since the attention mechanismmodule of CASSM had reduced the loss of sequenceinformation

+e ldquodiscrimrdquo denotes the models with discriminativetraining

53 Ablation Experiment In this section we conductedablation experiments to verify the effectiveness of ourproposed model Firstly we removed the convolutionalneural network module in our model and observed thespeech separation results of only attention module then wecontinued to remove the attention module and observed thespeech separation result of the model with only recurrentneural network

In our experiment the attention mechanism used wasLSTM unit with an attention length of 32 and RNN was aunidirectional RNN with 1000 hidden layers +e effect ofeach model was analyzed from the experimental resultsAfter removing the CNN module the GNSDR GSIR andGSAR of the model decreased by 037 dB 066 dB and018 dB respectively while the attention module was re-moved the GNSDR GSIR and GSAR of the model declinedby 057 dB 072 dB and 042 DB respectively On the wholethe speech separation performance was substantiallydecreased

+e results of ablation experiments show that afterremoving CNN and attention mechanism respectively thespeech separation ability of the model has decreased tovarying degrees (Figure 6)

54 Effect of Attention Length on Speech Separation +issimulation experiment explored the effect of attention lengthon speech separation through a speech separation model

73 735

1268 1305

953 942

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR

Figure 4 +e comparison for the effect of GRU and LSTM inASSM

708 715

12481287

931 922

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR

Figure 3 +e comparison for the effect of GRU and LSTM inRSSM


based on convolutional neural network and attentionmechanism In this experiment the attention mechanismused in this paper was based on a one-way RNN (hiddenlayers are 1000 layers LSTM units) the length of attentionwas 8 16 and 32 respectively

+e experimental results in Figure 7 showed that thespeech separation model with attention length of 16 was009 dB higher in GNSDR 02 dB in GSIR and 003 dBhigher in GSAR than the speech separation model withattention length of 8 Compared with the speech sepa-ration model with attention length of 16 the speechseparation model with attention length of 32 was 006 dBhigher in GNSDR 037 dB higher in GSIR and 01 dBlower in GSAR

+e experimental results indicated that the attentionlength of the attention mechanism had taken a relativelylarge effect on speech separation +e effect of speech

separation is getting better and better with the increase ofattention length It can be concluded from the simulationexperiment that the increase of attention length couldimprove the effect of speech separation

+e simulation results also illustrated that the trainingtime would increase with the increase of attention lengthand the demand for video memory would relatively in-crease therefore a reasonable choice of attention length ishighly recommended +ere are still some limitations andshortcomings in the experiment Due to the limitation ofequipment the attention length of the speech separationmodel could only be adjusted to 32 If it continues toincrease the device may report an insufficient memory+erefore research and discussion on this indicator infuture experiments can be carried out

6 Conclusion

+is paper proposed and implemented a speech separationmodel based on convolutional neural network and attentionmechanism Convolutional neural network can effectivelyextract low-dimensional features and mine the spatiotem-poral structure information in speech signals Attentionmechanism can reduce the loss of sequence information+eaccuracy of speech separation can be effectively improved bycombining two mechanisms +e simulation experimentsillustrated that the model had greater advantages over thespeech separation model based on recurrent neural networkand the effect of speech separation can be improved throughjoint optimization of convolutional neural network andattention mechanism

Data Availability

All data have been included in this article

Conflicts of Interest

+e authors declare that they have no conflicts of interest

772 735 715

1359 1293 1287

964 946 922

0

2

4

6

8

10

12

14

16

CASSM Without CNN WithoutCNN + Attention

dB

GNSDR

GSIR

GSAR

Figure 6 +e comparison for the effect of ablation experiments

757 766 772

1302 1322 1359

971 974 964

0

2

4

6

8

10

12

14

16

Length = 8 Length = 16 Length = 32

dB

Attention length

GNSDR

GSIR

GSAR

Figure 7 +e comparison for the effect of different attentionlengths

385497

745 772563

766

1308 1359

107 1003 968 964

02468

10121416

dB

MLR

R

RNM

F

DRN

N-2

+ d

iscrim

CASS

M

GNSDR

GSIR

GSAR

Figure 5 +e comparison for the effect of MLRR RNF DRNN-2 + discrim and CASSM


Acknowledgments

+is work was supported by the Program for the Science andTechnology Plans of Tianjin China under Grant no19JCTPJC49200

References

[1] T L Nwe and H Li ldquoExploring vibrato-motivated acousticfeatures for singer identificationrdquo IEEE Transactions onAudio Speech and Language Processing vol 15 no 2pp 519ndash530 2007

[2] W H Tsai and C H Ma ldquoTriangulation-based singeridentification for duet music data indexingrdquo in Proceedings ofthe IEEE International Congress on Big Data Anchorage AKUSA July 2014

[3] M I Mandel G E Poliner andD PW Ellis ldquoSupport vectormachine active learning for music retrievalrdquo MultimediaSystems vol 12 no 1 pp 3ndash13 2006

[4] M Hermans and B Schrauwen ldquoTraining and analyzing deeprecurrent neural networksrdquo in Proceedings of the InternationalConference on Neural Information Processing Systems CurranAssociates Inc Daegu South Korea November 2013

[5] A Krizhevsky I Sutskever and G Hinton ImageNet Clas-sification with Deep Convolutional Neural Networks CurranAssociates Inc Red Hook NY USA 2012

[6] D Bahdanau K Cho and Y Bengio ldquoNeural machinetranslation by jointly learning to align and translaterdquo httpsarxivorgabs14090473

[7] S Boll ldquoSuppression of acoustic noise in speech using spectralsubtractionrdquo IEEE Transactions on Acoustics Speech andSignal Processing vol 27 no 2 pp 113ndash120 1979

[8] J Chen J Benesty Y Huang et al ldquoNew insights into the noisereduction Wiener filterrdquo IEEE Transactions on Audio Speech ampLanguage Processing vol 14 no 4 pp 1218ndash1234 2006

[9] Y Laufer and S Gannot ldquoA bayesian hierarchical model forspeech enhancement with time-varying audio channelrdquo IEEEACM Transactions on Audio Speech and Language Process-ing vol 27 no 1 pp 225ndash239 2019

[10] S Samui I Chakrabarti and S K Ghosh ldquoTime-frequencymasking based supervised speech enhancement frameworkusing fuzzy deep belief networkrdquo Applied Soft Computingvol 74 pp 583ndash602 2019

[11] R Cusack J Decks G Aikman and R P Carlyon ldquoEffects oflocation frequency region and time course of selective at-tention on auditory scene analysisrdquo Journal of ExperimentalPsychology Human Perception and Performance vol 30 no 4pp 643ndash656 2004

[12] Y Chen ldquoSingle channel blind source separation based onNMF and its application to speech enhancementrdquo in Pro-ceedings of the 2017 IEEE 9th International Conference onCommunication Software and Networks (ICCSN) pp 1066ndash1069 Guangzhou China May 2017

[13] Z Li Y Song L Dai and I McLoughlin ldquoListening andgrouping an online autoregressive approach for monauralspeech separationrdquo IEEEACM Transactions on Audio Speechand Language Processing vol 27 no 4 pp 692ndash703 2019

[14] Y X Wang A Narayanan and D L Wang ldquoOn trainingtargets for supervised speech separationrdquo IEEEACMTransactions on Audio Speech and Language Processingvol 22 no 12 pp 1849ndash1858 2014

[15] L I U Wen-Ju S NIE S Liang et al ldquoDeep learning basedspeech separation technology and its developmentsrdquo ActaAutomatica Sinica vol 42 no 6 pp 819ndash833 2016

[16] D Wang and J Chen ldquoSupervised speech separation based ondeep learning an overviewrdquo IEEEACM Transactions onAudio Speech and Language Processing vol 26 no 10pp 1702ndash1726 2018

[17] J Zhou H Zhao J Chen and X Pan ldquoResearch on speechseparation technology based on deep learningrdquo ClusterComputing vol 22 no S4 pp 8887ndash8897 2019

[18] L Hui M Cai C Guo L He W Zhang and J LiuldquoConvolutional maxout neural networks for speech separa-tionrdquo in Proceedings of the IEEE International Symposium onSignal Processing and Information Technology (ISSPIT) IEEEAbu Dhabi UAE December 2015

[19] Y Sun W Wang J Chambers and S M Naqvi ldquoTwo-stagemonaural source separation in reverberant room environ-ments using deep neural networksrdquo IEEEACM Transactionson Audio Speech and Language Processing vol 27 no 1pp 125ndash139 2019

[20] C P Wang and T Zhu ldquoNeural network based phasecompensation methods on monaural speech separationrdquo inProceedings of the IEEE International Conference on Multi-media and Expo (ICME) pp 1384ndash1389 Shanghai ChinaJuly 2019

[21] L Zhou S Lu Q Zhong Y Chen Y Tang and Y ZhouldquoBinaural speech separation algorithm based on long andshort time memory networksrdquo Computers Materials ampContinua vol 63 no 3 pp 1373ndash1386 2020

[22] P Haffner L Bottou and Y Bengio ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2324 1998

[23] F Weninger J R Hershey J Le Roux and B SchullerldquoDiscriminatively trained recurrent neural networks forsingle-channel speech separationrdquo in Proceedings of the 2014IEEE Global Conference on Signal and Information Processingpp 577ndash581 IEEE Atlanta GA USA December 2014

[24] J Schmidhuber and J Schmidhuber ldquoLong short-termmemoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[25] K Cho B Van Merrienboer C Gulcehre et al ldquoLearningphrase representations using RNN encoder-decoder for sta-tistical machine translationrdquo in Proceedings of the 2014Conference on Empirical Methods in Natural Language Pro-cessing (EMNLP) pp 1724ndash1734 Doha Qatar October 2014

[26] Y Wang and D Wang ldquoA deep neural network for time-domain signal reconstructionrdquo in Proceedings of the 2015IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP) pp 4390ndash4394 IEEE SouthBrisbane Australia April 2015

[27] Z Shi H Lin L Liu et al ldquoFurcax end-to-end monauralspeech separation based on deep gated (de) convolutionalneural networks with adversarial example trainingrdquo inICASSP 2019 - 2019 IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP) IEEEBrighton UK May 2019

[28] A J R Simpson G Roma and M D Plumbley ldquoDeepkaraoke extracting vocals from musical mixtures using aconvolutional deep neural networkrdquo in Proceedings of the 12thInternational Conference on Latent Variable Analysis andSignal Separation (LVAICA 2015) Springer-Verlag NewYork NY USA August 2015

[29] Y Luo and N Mesgarani ldquoConv-TasNet surpassing idealtime-frequency magnitude masking for speech separationrdquoIEEEACM Transactions on Audio Speech and LanguageProcessing vol 27 no 8 pp 1256ndash1266 2019


[30] J X Wang S B Li H M Jiang and X X Bian ldquoSpeechseparation based on CHF-CNNrdquo Computer Simulationvol 36 no 5 pp 279ndash283 2019

[31] D Gabor ldquo+eory of communication Part 1 the analysis ofinformationrdquo Journal of the Institution of Electrical Engi-neersmdashPart III Radio and Communication Engineeringvol 93 no 26 pp 429ndash441 1946

[32] J Allen ldquoShort term spectral analysis synthesis and modi-fication by discrete Fourier transformrdquo IEEE Transactions onAcoustics Speech and Signal Processing vol 25 no 3pp 235ndash238 1977

[33] Z Yang D Yang C Dyer et al ldquoHierarchical attentionnetworks for document classificationrdquo in Proceedings of the2016 Conference of the North American Chapter of the As-sociation for Computational Linguistics Human LanguageTechnologies pp 1480ndash1489 San Diego CA USA June 2016

[34] M Delfarah and D Wang ldquoDeep learning for talker-de-pendent reverberant speaker separation an Empirical StudyrdquoIEEEACM Transactions on Audio Speech and LanguageProcessing vol 27 no 11 pp 1839ndash1848 2019

[35] J Cheng L Dong andM Lapata ldquoLong short-termmemory-networks for machine readingrdquo in Proceedings of the 2016Conference on Empirical Methods in Natural LanguageProcessing Austin TX USA November 2016

[36] P S Huang M Kim M Hasegawa-Johnson andP Smaragdis ldquoJoint optimization of masks and deep recur-rent neural networks for monaural source separationrdquo IEEEACM Transactions on Audio Speech and Language Process-ing vol 23 no 12 pp 2136ndash2147 2015

[37] C L Hsu and J S Jang ldquoOn the improvement of singing voiceseparation for monaural recordings using the MIR-1Kdatasetrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 18 no 2 pp 310ndash319 2010

[38] C Vincent R Gribonval and C Fevotte ldquoPerformancemeasurement in blind audio source separationrdquo IEEETransactions on Audio Speech and Language Processingvol 14 no 4 pp 1462ndash1469 2006

[39] R Gribonval P Philippe and F Bimbot ldquoAdaptation ofbayesian models for single-channel source separation and itsapplication to voicemusic separation in popular songsrdquo IEEETransactions on Audio Speech and Language Processingvol 15 no 5 pp 1564ndash1578 2007


separation are speech enhancement which analyze the dataof the mixed speech and the noise and then estimate thetarget speech through the interference estimation of themixed speech From the viewpoint of signal processingmany speech enhancement methods propose a powerspectrum of estimated noise or an ideal Wiener wave re-corder such as spectral subtraction [7] andWiener filter [8]+ere are many other ways to achieve speech enhancement(eg through Bayesian model [9] or time-frequencymasking methods [10]) Some researchers tried to usecomputer auditory scene analysis (CASA) to achieve speechseparation which is the use of computer technology to allowcomputers to imitate the processing of human auditorysignals for modeling so as to have the ability to perceivesound process sound and interpret sound from complexmixed sound sources such as humans +e basic compu-tational goal of auditory scene analysis is to estimate an idealbinary mask to achieve speech separation based on auditorymasking of the human ears [11ndash13] Speech enhancementtechnology generally assumes that the interference speech isstable so its separation performance will be severely de-graded when the actual interference noise is nonstationaryComputational auditory scene analysis with better gener-alization performance has no assumptions about noiseMeanwhile the computational auditory scene analysisheavily relies on speech pitch detection which is very dif-ficult under the interference of background sounds

Speech separation is designed to isolate useful signalsfrom disturbed speech signals a process that can beexpressed naturally as a supervised learning problem [14] Atypical supervised speech separation system learns a map-ping function from noisy features to separation targets (egideal masking or magnitude spectrum of speech interested)through supervised learning algorithms such as deep neuralnetworks Many supervised speech separation methods havebeen proposed in recent years [15ndash18] +e learning modelsfor supervised speech separation are mainly divided into twokinds (1) Traditional methods such asmodel-basedmethodsand speech enhancement methods (2) Newer methodsusing DNNs (Deep Neural Networks) Due to the speechgeneration mechanism the input features and output targetsof the speech separation show obvious spatiotemporalstructure +ese characteristics are very suitable for mod-eling with deep models Many deep models are widely usedin speech separation Sun et al [19] proposed two-stageapproach with two DNN-based methods to address theproblem of the performance of current speech separationmethods New training targets in complement of existingmagnitude training targets were trained through neuralnetwork methods to compensate for phase of target in orderto achieve better separation performance by the authors of[20] Zhou et al [21] designed a separation system on Re-current Neural Network (RNN) with long short-termmemory (LSTM) which effectively learns the temporaldynamics of spatial features Supervised speech separationdoes not require spatial orientation information of the soundsources and there are no restrictions on the statisticalcharacteristics of noise It shows obvious advantages and afairly bright research prospective under the conditions of

monaural nonstationary noise and low signal-to-noise ratio[22 23]

Deep Recurrent Neural Network (DRNN) among themis a representative of deep models and has been widely usedin speech separation DRNN has strong learning ability inspeech separation RNN series of units such as LSTM [24]GRU (Gated Recurrent Unit GRU) [25] all of whose hiddenstates are calculated according to the Markov model +eprevious hidden state will save some previous informationwhile the magnitude spectrum of the mixed speech is arelatively long sequence and sequence information will belost (after all the capacity of the cell state is limited inpractice) which will affect the separation of mixed speechand reduce the accuracy of estimated speech

Convolutional neural network (CNN) has been widelyused in deep learning since it was firstly proposed by YannLecun et al [26] in 1998 CNN has its natural advantages intwo-dimensional signal processing and its powerful mod-eling capabilities have been verified in tasks such as imagerecognition CNN at present has been applied to speechseparation [27 28] and has achieved the best separationperformance under the same conditions which has exceededthe DNN-based speech separation systems Luo et al [29]proposed Conv-TasNet which was a fully convolutionaltime-domain audio separation network to solve the prob-lems of time-frequency masking Wang et al [30] appliedCNN and modified its loss function to solve the imbalancebetween classification accuracy hit rate and error rate

Both the traditional and DNN-based separation modelshave achieved good results but they all have correspondingshortcomings Convolutional neural network can make useof the spatial connection of input data so that each elementcan learn local features without learning global features andthen at a higher-level these local features will be combinedtogether to get global features Weight sharing can reducethe parameters between different neurons reduce the cal-culation time and improve the speed of the model By usinga variety of convolution filters multiple feature maps can beobtained which can detect the same type of features indifferent positions and ensure the invariance of displace-ment and deformation to a certain extent +erefore thispaper proposes a method based on convolutional neuralnetwork to solve the problem of the loss of long sequenceinformation of mixed speech Combined with the attentionmechanism our model can better focus on the timing se-quence step which contributes the most and solve theproblem of insufficient memory of the temporal model to acertain extent so as to improve the speech separation effect

3 System Modeling

+e purpose of the speech separation is to separate the targetvoice from the background sound +e overall structure ofthe speech separation system is shown in Figure 1 +e firstprocess of speech separation is to obtain the magnitudespectrum and phase information of mixed speech throughSTFT (Short-time Fourier Transform) [31] +e estimatedtargets using speech magnitude spectrum have been shownto suppress noise significantly and improve the speech



4 Proposed Methods









Mixturesignal

Evaluation

STFT

ISTFT

CNNlayer

Attentionlayer

Magnitudespectra

Phasespectra

Estimatedmagnitude

spectra












minus jωxdx (1)




Bi-directionalRNN


Magnitude spectra

Max pooling

Convolutionallayer

Convolutionallayer

Normalization

High speed network

Convolutionallayer

Convolutionallayer

Input layer








Uti V

Ttan h Wαht + Wβoi1113872 1113873

ati softmax U

ti1113872 1113873

htprime 1113944

tminus1

i1a

tioi

(2)






1113957y1 1113954y1

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

1113957y2 1113954y2

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

(3)





+ y2 minus 1113957y

22



22

cgt 0

(4)





























73 735

1268 1305

953 942

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR


708 715

12481287

931 922

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR








6 Conclusion


Data Availability




772 735 715

1359 1293 1287

964 946 922

0

2

4

6

8

10

12

14

16


dB

GNSDR

GSIR

GSAR


757 766 772

1302 1322 1359

971 974 964

0

2

4

6

8

10

12

14

16


dB

Attention length

GNSDR

GSIR

GSAR


385497

745 772563

766

1308 1359

107 1003 968 964

02468

10121416

dB

MLR

R

RNM

F

DRN

N-2

+ d

iscrim

CASS

M

GNSDR

GSIR

GSAR



Acknowledgments


References











































4 Proposed Methods









Mixturesignal

Evaluation

STFT

ISTFT

CNNlayer

Attentionlayer

Magnitudespectra

Phasespectra

Estimatedmagnitude

spectra












minus jωxdx (1)




Bi-directionalRNN


Magnitude spectra

Max pooling

Convolutionallayer

Convolutionallayer

Normalization

High speed network

Convolutionallayer

Convolutionallayer

Input layer








Uti V

Ttan h Wαht + Wβoi1113872 1113873

ati softmax U

ti1113872 1113873

htprime 1113944

tminus1

i1a

tioi

(2)






1113957y1 1113954y1

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

1113957y2 1113954y2

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

(3)





+ y2 minus 1113957y

22



22

cgt 0

(4)





























73 735

1268 1305

953 942

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR


708 715

12481287

931 922

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR








6 Conclusion


Data Availability




772 735 715

1359 1293 1287

964 946 922

0

2

4

6

8

10

12

14

16


dB

GNSDR

GSIR

GSAR


757 766 772

1302 1322 1359

971 974 964

0

2

4

6

8

10

12

14

16


dB

Attention length

GNSDR

GSIR

GSAR


385497

745 772563

766

1308 1359

107 1003 968 964

02468

10121416

dB

MLR

R

RNM

F

DRN

N-2

+ d

iscrim

CASS

M

GNSDR

GSIR

GSAR



Acknowledgments


References



















































minus jωxdx (1)




Bi-directionalRNN


Magnitude spectra

Max pooling

Convolutionallayer

Convolutionallayer

Normalization

High speed network

Convolutionallayer

Convolutionallayer

Input layer








Uti V

Ttan h Wαht + Wβoi1113872 1113873

ati softmax U

ti1113872 1113873

htprime 1113944

tminus1

i1a

tioi

(2)






1113957y1 1113954y1

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

1113957y2 1113954y2

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

(3)





+ y2 minus 1113957y

22



22

cgt 0

(4)





























73 735

1268 1305

953 942

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR


708 715

12481287

931 922

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR








6 Conclusion


Data Availability




772 735 715

1359 1293 1287

964 946 922

0

2

4

6

8

10

12

14

16


dB

GNSDR

GSIR

GSAR


757 766 772

1302 1322 1359

971 974 964

0

2

4

6

8

10

12

14

16


dB

Attention length

GNSDR

GSIR

GSAR


385497

745 772563

766

1308 1359

107 1003 968 964

02468

10121416

dB

MLR

R

RNM

F

DRN

N-2

+ d

iscrim

CASS

M

GNSDR

GSIR

GSAR



Acknowledgments


References















































Uti V

Ttan h Wαht + Wβoi1113872 1113873

ati softmax U

ti1113872 1113873

htprime 1113944

tminus1

i1a

tioi

(2)






1113957y1 1113954y1

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

1113957y2 1113954y2

11138681113868111386811138681113868111386811138681113868

1113954y11113868111386811138681113868

1113868111386811138681113868 + 1113954y21113868111386811138681113868

1113868111386811138681113868⊙ z

(3)





+ y2 minus 1113957y

22



22

cgt 0

(4)





























73 735

1268 1305

953 942

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR


708 715

12481287

931 922

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR








6 Conclusion


Data Availability




772 735 715

1359 1293 1287

964 946 922

0

2

4

6

8

10

12

14

16


dB

GNSDR

GSIR

GSAR


757 766 772

1302 1322 1359

971 974 964

0

2

4

6

8

10

12

14

16


dB

Attention length

GNSDR

GSIR

GSAR


385497

745 772563

766

1308 1359

107 1003 968 964

02468

10121416

dB

MLR

R

RNM

F

DRN

N-2

+ d

iscrim

CASS

M

GNSDR

GSIR

GSAR



Acknowledgments


References




































































73 735

1268 1305

953 942

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR


708 715

12481287

931 922

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR








6 Conclusion


Data Availability




772 735 715

1359 1293 1287

964 946 922

0

2

4

6

8

10

12

14

16


dB

GNSDR

GSIR

GSAR


757 766 772

1302 1322 1359

971 974 964

0

2

4

6

8

10

12

14

16


dB

Attention length

GNSDR

GSIR

GSAR


385497

745 772563

766

1308 1359

107 1003 968 964

02468

10121416

dB

MLR

R

RNM

F

DRN

N-2

+ d

iscrim

CASS

M

GNSDR

GSIR

GSAR



Acknowledgments


References



















































73 735

1268 1305

953 942

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR


708 715

12481287

931 922

0

2

4

6

8

10

12

14

GRU LSTM

dB

GNSDR

GSIR

GSAR








6 Conclusion


Data Availability




772 735 715

1359 1293 1287

964 946 922

0

2

4

6

8

10

12

14

16


dB

GNSDR

GSIR

GSAR


757 766 772

1302 1322 1359

971 974 964

0

2

4

6

8

10

12

14

16


dB

Attention length

GNSDR

GSIR

GSAR


385497

745 772563

766

1308 1359

107 1003 968 964

02468

10121416

dB

MLR

R

RNM

F

DRN

N-2

+ d

iscrim

CASS

M

GNSDR

GSIR

GSAR



Acknowledgments


References















































6 Conclusion


Data Availability




772 735 715

1359 1293 1287

964 946 922

0

2

4

6

8

10

12

14

16


dB

GNSDR

GSIR

GSAR


757 766 772

1302 1322 1359

971 974 964

0

2

4

6

8

10

12

14

16


dB

Attention length

GNSDR

GSIR

GSAR


385497

745 772563

766

1308 1359

107 1003 968 964

02468

10121416

dB

MLR

R

RNM

F

DRN

N-2

+ d

iscrim

CASS

M

GNSDR

GSIR

GSAR



Acknowledgments


References










































Acknowledgments


References





















































Documents

SpeechSeparationUsingConvolutionalNeuralNetworkand ...downloads.hindawi.com/journals/ddns/2020/2196893.pdfcharacteristics of the convolutional neural network and attention mechanism,