75
Task Load Modelling for LTE Baseband Signal Processing with Artificial Neural Network Approach LU WANG Master’s Degree Project Stockholm, Sweden 2014 XR-EE-SB 2014:013

Task Load Modelling for LTE Baseband Signal Processing with Arti …792776/FULLTEXT01.pdf · for LTE Baseband Signal Processing with Arti cial Neural Network Approach LU WANG Master’s

Embed Size (px)

Citation preview

Task Load Modellingfor LTE Baseband Signal Processing

with Artificial Neural Network Approach

LU WANG

Masters Degree ProjectStockholm, Sweden 2014

XR-EE-SB 2014:013

Abstract

This thesis gives a research on developing an automatic or guided-automatictool to predict the hardware (HW) resource occupation, namely task load, withrespect to the software (SW) application algorithm parameters in an LTE basestation. For the signal processing in an LTE base station it is important toget knowledge of how many HW resources will be used when applying a SWalgorithm on a specific platform. The information is valuable for one to knowthe system and platform better, which can facilitate a reasonable use of theavailable resources.

The process of developing the tool is considered to be the process of buildinga mathematical model between HW task load and SW parameters, wherethe process is defined as function approximation. According to the universalapproximation theorem, the problem can be solved by an intelligent methodcalled artificial neural networks (ANNs). The theorem indicates that anyfunction can be approximated with a two-layered neural network as long asthe activation function and number of hidden neurons are proper. The thesisdocuments a work flow on building the model with the ANN method, as wellas some research on data subset selection with mathematical methods, such asPartial Correlation and Sequential Searching as a data pre-processing step forthe ANN approach. In order to make the data selection method suitable forANNs, a modification has been made on Sequential Searching method, whichgives a better result.

The results show that it is possible to develop such a guided-automatictool for prediction purposes in LTE baseband signal processing under specificprecision constraints. Compared to other approaches, this model tool withintelligent approach has a higher precision level and a better adaptivity, meaningthat it can be used in any part of the platform even though the transmissionchannels are different.

Key words Automatic tool, Signal Processing, Function Approximation,Prediction, ANNs, Data Pre-processing, Task Load Prediction.

Sammanfattning

Denna avhandling utvecklar ett automatiskt eller ett guidat automatiskt verktygfor att forutsaga behov av hardvaruresurser, ocksa kallat uppgiftsbelastning,med avseende pa programvarans algoritmparametrar i en LTE basstation. Isignalbehandling i en LTE basstation, ar det viktigt att fa kunskap om hurmycket av hardvarans resurser som kommer att tas i bruk nar en programvaraska koras pa en viss plattform. Informationen ar vardefull for nagon att forstasystemet och plattformen battre, vilket kan mojliggora en rimlig anvandning avtillgangliga resurser.

Processen att utveckla verktyget anses vara processen att bygga en matem-atisk modell mellan hardvarans belastning och programvaruparametrarna, darprocessen definieras som approximation av en funktion. Enligt den universellaapproximationssatsen, kan problemet losas genom en intelligent metod somkallas artificiella neuronnat (ANN). Satsen visar att en godtycklig funktion kanapproximeras med ett tva-skiktS neuralt natverk sa lange aktiveringsfunktionenoch antalet dolda neuroner ar korrekt. Avhandlingen dokumenterar ett arbets-flode for att bygga modellen med ANN-metoden, samt studerar matematiskametoder for val av delmangder av data, sasom Partiell korrelation och sekventiellsokning som dataforbehandlingssteg for ANN. For att gora valet av uppgiftersom lampar sig for ANN har en andring gjorts i den sekventiella sokmetoden,som ger battre resultat.

Resultaten visar att det ar mojligt att utveckla ett sadant guidat automatisktverktyg for prediktionsandamal i LTE basbandssignalbehandling under specifikaprecisions begransningar. Jamfort med andra metoder, har dessa modellverktygmed intelligent tillvagagangssatt en hogre precisionsniva och battre adaptivitet,vilket innebar att den kan anvandas i godtycklig del av plattformen aven omoverforingskanalerna ar olika.

Nyckelord Automatiskt verktyg, signalbehandling, Funktionsanpass-ning, Prediktion, Artificiella Neuronnat, Dataforbehandling.

Acknowledgment

Any thesis project like this cannot be accomplished by only one person; therewere many people that either gave technical support or inspiring guidance.

First of all, I would like to express my deepest gratitude to our supervisorHenrik Olson, for his superior guidance and coaching during weekly meetings,and for his patient reviewing and advice of our report during his vacation time.His encouragement gave us a lot of motivations to finish the task, the projectgoal couldnt be achieved successfully without his guidance.

I also would like to give my grateful appreciation to our supervisor JohnNilsson, who gave us a lot of technical support on extracting the data that weneeded in the thesis project, and also thanks for his inspiration on my neuralnetwork approach, so that I could have the confidence to learn and research ona completely new technique by myself. My appreciation for Johan Parin andJianrong Zhang, who gave us much help during the period of understanding andextracting data even though based on their busy schedule.

Im also grateful to our manager Jonas Allander, who offered me sucha great opportunity to do my master thesis at Ericsson and thanks for therecommendation of the job position, where I could make further contribution forEricsson. My thanks are also extended to Mathias Ekwing, who gave affirmationon me and offered this thesis opportunity.

In addition, I would like to thank you Professor Magnus Jansson, who agreedto be our examiner and gave us much encouragement during the whole thesiswork.

Last but not least, I sincerely thank my colleague Chang Liu, whom Icooperated with during the whole process. We discussed and solved manyproblems together, and he was also the one who gave me much inspirationand patient help.

I thank you all my family and friends, it was your support that encouragedme to accomplish the thesis task. My grateful thanks.

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Data Analysis and Related Work . . . . . . . . . . . . . . 41.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Artificial Neural Networks 82.1 Biological Neural Networks . . . . . . . . . . . . . . . . . . . . . 82.2 Artificial Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Neural Network Topology . . . . . . . . . . . . . . . . . . . . . . 11

3 Methodology 143.1 Neural Network Algorithm . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Weights Initialization Algorithm . . . . . . . . . . . . . . 143.1.2 Back-propagation Algorithm . . . . . . . . . . . . . . . . 153.1.3 Levenberg-Marquardt Algorithm . . . . . . . . . . . . . . 173.1.4 Training Procedure . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . 203.2.2 Data Normalization . . . . . . . . . . . . . . . . . . . . . 28

4 Simulation and Results 294.1 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Model Build Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.1 Model Fit Criteria . . . . . . . . . . . . . . . . . . . . . . 304.3 Result and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Type I Task Representation: Function1 . . . . . . . . . . 334.3.2 Type II Task Representation: Function12 . . . . . . . . . 414.3.3 Type III Task Representation: Function5 . . . . . . . . . 454.3.4 Type IV Special Cases - Function14 . . . . . . . . . . . . 47

4.4 Model Tool Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Conclusions 50

i

6 Future Work 516.1 Optimization with Hidden Neurons . . . . . . . . . . . . . . . . . 51

6.1.1 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.1.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . 52

7 Appendix A 54

ii

List of Figures

1.1 Communication Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Signal Processing Chain. . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Scatter Plot of Task1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Scatter Plot of Task2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Scatter Plot of Task12. . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Biological Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Artificial Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Activation Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Feed-Forward Network. . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Sum Squared Error with respect to Weights. . . . . . . . . . . . . . . . . . 163.2 Network Performance on Different Hidden Size. . . . . . . . . . . . . . . . 193.3 Graph Illustration of Partial Correlation. . . . . . . . . . . . . . . . . . . 233.4 Network Performance with SequentialFS. . . . . . . . . . . . . . . . . . . 263.5 Accumulate Number of Retain Variables with SequentialFS. . . . . . . . . . . 263.6 Network Performance with Modified SequentialFS. . . . . . . . . . . . . . . 263.7 Accumulate Number of Retain Variables with Modified SequentialFS. . . . . . 26

4.1 Flow Chart of Model Build. . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Ideal Regression between Predicted Output Data and Target Data. . . . . . . 334.3 Noisy Regression between Predicted Output Data and Target Data. . . . . . . 334.4 Function1 : Regression plot between target data and estimate output data. . . . 354.5 Function1 : Performance plot of Function1. . . . . . . . . . . . . . . . . . 354.6 Function1 : Scatter plot of predictor with respect to target data. . . . . . . . 364.7 Function1 : Scatter plot of 100 sample points. . . . . . . . . . . . . . . . . 364.8 Function1 : Trained neural network of Function1. . . . . . . . . . . . . . . 364.9 Function2 : Scatter plot of predictor 5 with respect to task load. . . . . . . . 394.10 Function2 : Scatter plot of predictor 7 with respect to task load. . . . . . . . 394.11 Function2 : Subset selection with Modified SequentialFS method. . . . . . . . 394.12 Function12 : Regression plot between target data and estimate output data. . . 424.13 Function12 : Scatter plot of predictor 2 with respect to task load. . . . . . . . 434.14 Function12 : Scatter plot of predictor 6 with respect to task load. . . . . . . . 434.15 Function12 : Scatter plot of predictor 7 with respect to task load. . . . . . . . 434.16 Function12 : Scatter plot of predictor 8 with respect to task load. . . . . . . . 434.17 Function5 : Regression plot between target data and estimate output data. . . . 464.18 Function5 : Scatter plot of predictor 5 with respect to task load. . . . . . . . 474.19 Function5 : Scatter plot of sample point. . . . . . . . . . . . . . . . . . . 474.20 Function8 : Scatter plot of predictor 5 with respect to task load. . . . . . . . 47

i

4.21 Function14 : Regression plot between target data and estimate output data. . . 484.22 Function14 : Scatter plot of predictor 5 with respect to task load. . . . . . . . 484.23 The whole model work flow. . . . . . . . . . . . . . . . . . . . . . . . . 49

i

List of Tables

3.1 Network Performance Result with SequentialFS method. . . . . . . . . . . . 273.2 Network Performance Result with Modified SequentialFS method. . . . . . . . 27

4.1 Range of Coefficient of Correlation. . . . . . . . . . . . . . . . . . . . . . 324.2 Function1 : Subset Selection with PCA. . . . . . . . . . . . . . . . . . . . 334.3 Function1 : Coefficient of correlation between predictors and response. . . . . . 344.4 Function1 : Network performance with different hidden size based on Partial

Correlation subset selection method. . . . . . . . . . . . . . . . . . . . . 344.5 Function1 : Trained neural network weights results. . . . . . . . . . . . . . 374.6 Function1 : Network performance based on Principal Component Analysis subset

selection method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.7 Function approximation results of Type I Function. . . . . . . . . . . . . . 384.8 Function15 : Function approximation results with Partial Correlation. . . . . . 404.9 Function12 : Network performance with different hidden size based on Partial

Correlation subset selection method. . . . . . . . . . . . . . . . . . . . . 404.10 Function12 : Coefficient of correlation between predictors and response. . . . . 414.11 Function12 : Network performance with different hidden size based on Partial

Correlation subset selection method. . . . . . . . . . . . . . . . . . . . . 414.12 Function12 : Trained neural network weights results. . . . . . . . . . . . . . 424.13 Function approximation results of Type II Function. . . . . . . . . . . . . . 444.14 Function approximation results. . . . . . . . . . . . . . . . . . . . . . . 444.15 Function5 : Coefficient of correlation between predictors and response. . . . . . 454.16 Function5 : Network performance with different hidden size based on Partial

Correlation subset selection method. . . . . . . . . . . . . . . . . . . . . 45

7.1 Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Chapter 1

Introduction

LTE, short for Long Term Evolution and commonly denoted as 4G, is astandard specified by 3GPP (Third-Generation Partnership Project) for wirelesscommunication with high-speed data[4]. With new digital signal processing(DSP) techniques, including modulation scheme with orthogonal-frequencydivision multiple access (OFDMA) and multiple use of antennas with Multiple-Input and Multiple-Output (MIMO) technology, the capacity and speed ofwireless communication have been increased dramatically.

In wireless communication systems, the term base station plays an importantrole as one terminal of a communication link, which is shown in Figure 1.1. Thedashed lines indicate the signal transmission line. The signal is transmitted fromthe transmitter terminal, through the base stations to the receiver end. Thereal-world analog signals are converted to digital signals and processed throughthe communication link, where a large amount of mathematical operations arerequired for DSP algorithms performed on data samples. In the LTE basestation, these algorithms together with the 3GPP radio network specificationare realized in control software, signal processing software and radio software,which execute on different hardware platforms using different programmingmodels. During the signal transmission chain, signals are processed withmultiple procedures, such as modulation, demodulation, coding and decoding.Among these procedures numbers of associated parameters are generated, takethe modulation scheme as an example, the schemes can be QPSK, 16QAM or64 QAM, so the number of signal bits used for modulation can be 2, 4, or6, which are the so called SW parameters in this report. With specific SWrunning, the application always occupies a certain amount of HW resourcessuch as computational cores, execution time and HW accelerators based onsame environment.

In the area of baseband processing, the performed signal processing type isa mix of control and DSP, the real-time critical part of the application of physicallayer scheduling is realized. And the signal processing chains in general containinterleaved HW accelerated tasks in order to save power, area and latency.

1

Figure 1.1: Communication Link.

Exceeding the customers or operators expectations in terms of increasedfeatures and capacity of the baseband SW and HW is a challenging andextremely rewarding task. Ericsson works with cutting edge SW solutions on ahighly potent HW which has been tailor made for LTE.

1.1 Background

Since in the baseband area, any SW applications will cause HW resourcesoccupation when it is running, it is critical to have prior knowledge of howmuch processing resources are required for different parts of the application,which can be used for several purposes:

Monitoring hardware resource consumption

Estimating the current resource head room in order to take advantage ofavailable resources

Better understanding of the current system

Design exploration with high level simulation

1.2 Problem Definition

This thesis work aims to make an automatic or guided-automatic tool toprocess a specific type of data that are gathered in the testing of digital units,while no prior knowledge on the data is a premise.

Figure 1.2 shows a signal processing chain, as the figure denotes, the signal istransmitted from the bottom of the chain and goes following the arrows throughdifferent functions. With the signal processed through those specific functions,execution time will be consumed, which is named as task load in this thesiswork. Each rounded rectangle with solid line is treated as one SW processingtask, with one or more specific functions to be processed, while the rounded

2

rectangle with dashed line is related to HW processing task that is ignored inthis thesis work.

Figure 1.2: Signal Processing Chain.

In a real case for each task the mathematical model between task load andSW parameters can be expressed as:

YL = f(X1, X2, ..., Xn) +Nnoise (1.1)

YL denotes the task load of the Lth task, (X1, X2, ..., Xn) represent the inputSW parameters, n is the number of SW parameters, while Nnoise is the unknownbackground noise of the test system. Function f is the real case function betweenSW parameters and HW task load that is unknown. So the model to be builtfor estimating the task load with respect to the input SW parameters shouldhave the following expression:

YLest = fest(X1, X2, ..., Xm),m n. (1.2)

YLest is the estimated task load based on real task load value. (X1, X2, ..., Xn)represent the input SW parameters, while m denotes the number of SWparameters X which is actually used in the model, it is expected that m shouldbe as small as possible, so the model can be simple and cost less. fest is theestimated function that is to be built from the given input and output data,while no more prior background knowledge is given. The smaller m is, thesimpler fest is.

The goal of the thesis is to give a mathematical model on each specific taskto uncover the relationship between input parameters and task load, meanwhilethe model should also have the ability to predict the task load given certaininput parameters. With this model tool, one expects that whenever given theinput data set, predictions of task load related to specific input parameters canbe produced within certain precision criteria.

Theoretically, the procedure of abstracting a model from known input-output pair wise data is denoted as function approximation. In reality, function

3

approximation plays an important role in exploring the underlying relationshipbetween variables and it has been implemented in various problems such asprediction, classification or recognition. Many methods have been developed toaddress the problem, where one of the most widely used one is artificial neuralnetworks.

1.2.1 Data Analysis and Related Work

In LTE baseband, the SW has been instrumented to trace out a number ofparameters that are believed to affect the HW execution times, and some ofthem are highly correlated with HW execution time, while others contributeless. See several scatter plots between the SW parameters and HW executiontime as examples to obtain a better understanding of the data.

0 1X1

task

load

0 1X2

task

load

0 1X3

task

load

0 4X4

task

load

0 96X5

task

load

0 1X6

task

load

0 1X7

task

load

0 838X8

task

load

0 104X9

task

load

0 2X10

task

load

0 2X11

task

load

Figure 1.3: Scatter Plot of Task1.

In Figure 1.3, the scatter plots give a first impression of the relationshipbetween SW parameters and HW execution time, where the x-axis X1, X2, ...denotes different SW parameters, and the y-axis denotes the HW execution time.From the sub-figures it is easy to see that predictor X5 has a linear relationshipwith task load, and the correlation between task load and X8, X9 are hard tosay respectively, while other parameters with only two or three different samplevalues are even harder to tell the relationship.

Another case can be induced from Figure 1.4, where the relationship betweenX5 and task load is not simple linear, it can be seen that another variable also

4

Figure 1.4: Scatter Plot of Task2.

0 1X1

task

load

0 1X2

task

load

0 1X3

task

load

0 4X4

task

load

0 96X5

task

load

0 1X6

task

load

0 824X7

task

load

0 80X8

task

load

0 2X9

task

load

0 2X10

task

load

Figure 1.5: Scatter Plot of Task12.

controls the variation of task load. Moreover in Figure 1.5 the relationship getseven fuzzier and the relationship is not obvious. In this case, the function of task

5

is hard to approximate by a precise model. In reality, the relationship betweentwo or more variables usually cannot be simply expressed linearly. Most of therelationships are like the one in Figure 1.5 or even more complex, so it is difficultto deduce the exactly relationship between variables. Taking a general view ofall these three task scatter plots, some common data features can be found:

High dimensional input data (SW parameters).

Noisy.

Unknown relationship between SW parameters and HW task load.

The background noise level can be estimated with mean square error (MSE).From the analysis of data, in real case for every set of data, there is a model-free error that formulates a theoretical minimum MSE that cannot be exceeded.As has been mentioned, the real case task load includes background noise, soan important step of modeling build is to estimate the background noise level,denoted as MSEN , which is calculated as explained in the following steps:first divide the original sample data into different classes, within each class thepredictor variables are the same among each sample while the response variablesare different; then sum the squared error of response variables of all the classestogether, while excluding those classes with only one sample inside; Finally thesum is divided by the sum of samples taken into consideration, and the resultis the estimation of noise:

MSEN =

Cj=1

cji=1(yi ycj )2Cj=1 cj

(1.3)

where C is the total number of classes, and cj is the number of samples withinthe jth class, ycj is the mean value of response variable y in the jth class. Thevalue of MSEN can be used to measure the quality of the model results.

In order to solve the problem of data with all the features above, the learningability of ANN can give a solution. The learning ability means that given ANNstructure, it is able to learn the functional dependence between input andoutput, which is denoted as supervised learning. With this learned network, itis possible to predict new output data values with input data from the samepopulation of the sample data. So from this point of view, learning of ANN isequivalent to a function approximation problem. And according to the universalapproximation theorem, simple neural networks can approximate a wide varietyof interesting functions when given appropriate parameters [1].

In the physical world, especially the in engineering field, the modelingand numerical analysis of complex and fuzzy systems play an important role,and artificial neural networks have been used for such purposes. The majoradvantage of ANN is its strong human-brain-like self-learning and problemsolving ability, including robustness to noise and adaptation to high-dimensionalproblems [13]. Different kinds of neural network topologies have been developedfor different use, some of them are adequate for function approximation, whileothers are suitable for some other kinds of problems such as classification and

6

pattern recognition. The most popular topology for function approximationis Multi Layer Perceptron (MLP) neural network, which belongs to the feed-forward network class. The multidimensional nonlinear functions approximatedby feed-forward neural networks with algebraic training method are presentedin [7]. The algebraic training method is fast and simple to use for trainingprocedures with input, output data and derivative information. The method in[16] also uses first and second order derivatives, which presents better trainingresults. However, for the data in this thesis, the derivatives are hard to usebecause of the binary parameters. Several experiments are given in [14] toillustrate the ability of approximation of neural network for high dimensionfunctions, which means that the neural network approach is applicable forhigh dimension problems. For this thesis problem, the function approximationabilities of ANN should be applied and it has been approved that any preciselevel of function can be approximated by a nonlinear mapping network as longas the network structure design is proper.

1.3 Outline

The thesis work will be presented as follows. First of all an introduction toartificial neural networks will be provided and give readers a first impressionon what ANNs are. Then the methodology part follows, where all the detailedalgorithms and techniques will be presented. In chapter 4, the simulation ofthe thesis work is made, and results and analysis are presented. Future workthat could be implemented later to improve the results is also mentioned in thischapter. Last but not least, conclusions are given.

7

Chapter 2

Artificial Neural Networks

Artificial Neural Network is a mathematical model of human brain system.The method can be considered as another approach to the problem ofcomputation, which is much faster than the conventional digital computers.Many ANN applications of solving problems such as pattern recognition,function approximation, prediction and feature selection have been researchedby researchers from scientific field, proving the learning abilities of ANN.

Through the historical overview, the knowledge of human brain had not beendiscovered until the mechanism of communication interconnection was broughtup by neuroanatomists and neurophysiologists. And the first contribution ofneural network was the M-P model, which was designed by McCulloch andPitts in 1943, giving the basic description of biological neuron properties. Thefirst peak of ANN was from 1950 to 1968. During this period, the perceptronstructure had been successfully constructed. This approach was researched byMarvin Minsky, Frank Rosenblatt, Bernard Widrow, et al, and it had almostbecome the key technique of intelligence until it was denied because of disabilityof solving an XOR problem. It was not until the beginning of 1980s, the secondpeak of ANN came with the contributions of energy approach by J.Hopfield, andthe back-propagation (BP) learning algorithm by Paker and Werbos separately.Until now, BP algorithm is still the core part of neural network learning process.

As the artificial neural networks can be designed to model a particular taskas the way human brain performs, a brief introduction of the biological neuralnetworks will be presented in this chapter before the detailed ANN.

2.1 Biological Neural Networks

Biological neural network consists of a large number of neurons, which arespecial biological cells that process information. Shown as Figure 2.1, a neuronis composed of a cell body, two types of branches and a synapse. Inside the

8

cell body there is a nucleus containing information about hereditary traits.The two types of tree-like branches are axon and dendrites. The axon actsas a transmitter while a dendrite acts as a receiver. At the end of the axon,there is a synapse used as the connection between two neighbour neurons. Abiological neuron works like this: when a signal is generated by its cell body andtransmitted along the axon, it finally arrives at synapse. For a synapse thereare always two statuses, either it is excited or it is restrained, and it also has anexcitation threshold to decide its status. When the signal arrives at synapse,the strength of this signal will give an excitation to the synapse. If the strengthexceeds the threshold of excitation, the synapse will be excited and transmit thesignal to its connective neuron through the dendrite. However, if the strength isbelow the threshold of excitation, the synapse will be restrained and the signalwill not be transmitted continuously. The neuron has the so-called learningability, meaning that it can learn from the activities it participates. The learningability is the key motivation for developing the artificial neural networks.

Figure 2.1: Biological Neuron.

2.2 Artificial Neurons

Similar as biological neuron, an artificial neuron is the basic informationprocessing unit of artificial neural networks and it is a mathematical model ofbiological neuron. Figure 2.2 shows the model structure of an artificial neuron,with input signals summed together with weight connections and passed throughan activation function to formulate the output signal. Shown as the figure, wherethree basic elements of a neural network model can be identified:

1. Connection: Connection is characterized by connection weights, whichrepresent the strength of the links. Suppose one has n signals X =[x1, x2, ..., xn], and the associated weights are W = [w1, w2, ..., wn], then

9

the net input signals of a neuron is calculated as

net(x) =

ni=1

xiwi =

XW (2.1)

2. Bias: In an artificial neural network model, there is always a constant1 as an input of the network, and it is known as bias. According torelated work, a plausible interpretation is mentioned by D.Kriesel. Inhis book, the bias is explained as a technical trick for simplifying thetraining procedure. And as just said above, every activation function hasa threshold to determine its status, denoted as . However, it is complexto realize the training step of . So a bias is defined as an additional inputwith connection weights equal to , thus to balance the threshold.

3. Activation Function: The activation function is an imitation of biologicalneuron status. Given a set of input signals and connection weights, anactivation function gives a limitation of the output of the net input.Activation function is a key part of artificial neurons with different typeof functions, details of activation function will be given in next section.

Figure 2.2: Artificial Neuron.

2.3 Activation Function

According to the property of biological neuron, each neuron has a threshold ofexcitation, this threshold is always binary, either ON or OFF. It means thatwhen the stimulation of input signals is larger than the threshold, there will bean output response and otherwise no response is given out from that neuron. Soinspired by the property of biological neuron, the activation function is definedas:

st = (net(x), st1, )

The function transforms the network input net and previous activation statest1 to new activation state st according to the threshold [5]. However, inreality, the activation function of artificial neuron : R Rn is expectedto be more general and continuous than just binary property, so several typesof activation functions are developed, and the four main kinds of activationfunctions are listed in Figure 2.3.

10

1. Linear FunctionLinear function is the most fundamental activation function, it acts as alinear amplification of the network input signals:

(x) = kx+ c (2.2)

2. Piecewise Linear FunctionAlthough linear function is quite simple to use, it limits the performanceof network and decreases the nonlinearity. In order to overcome thisdisadvantage, piecewise linear function is introduced:

(x) =

1 x kx x 0 x

(2.3)

where is a threshold.

3. Step FunctionThe step function has the most likely ability of human brain, with onlytwo status, either 1 or 0:

(x) =

{1 x 00 x < 0

(2.4)

4. Sigmoid FunctionSigmoid function is a kind of S-shaped graph function, and it is themost commonly used function in artificial neural network because ofits differentiable ability, which is of vital importance to apply back-propagation algorithm. Two kinds of sigmoid function are usually used,named as logistic function and hyperbolic tangent function. Thedifference between these two functions is that the range for the formeris [0 1], while for the latter, it is [-1 +1].

(x) =1

1 + ex(2.5)

(x) = tanh(x) =1 + ex

1 ex(2.6)

where equation 2.5 is logistic function and equation 2.6 is hyperbolictangent function.

2.4 Neural Network Topology

Basically speaking, the existing neural network topology can be divided intothree basic categories which include feed-forward network, feed-back networkand self-organizing network. The most widely used one is feed-forward neuralnetwork topology. A two layered feed-forward network has been proved with

11

5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Piecewise Linear Function

(a) Piecewise Linear Function.

5 0 50.5

0

0.5

1

1.5

x

y

Step Function

(b) Step Function.

10 5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Sigmoid logistic function

(c) Logistic Sigmoid Function

10 5 0 5 101

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

x

y

Sigmoid hyperbolic tangent function

(d) Hyperbolic Tangent Sigmoid Function.

Figure 2.3: Activation Functions.

the ability of approximating arbitrary functions. So in this thesis, only the feed-forward network is considered, while details of the other two architectures canbe found in [11].

A feed-forward neural network consists of multiple artificial neurons, withthe output of the former neuron connected as the input of the latter neuron,while no other connections are in this network. Figure 2.4 gives an architectureof multi-layered neural networks, where there is one input layer, one hiddenlayer and one output layer. It has been proved that this is the most widelyused structure of neural network for function approximation problems, and theuniversal approximation also denotes that a neural network with one hiddenlayer of multiple neurons can approximate arbitrary function.

Figure 2.4: Feed-Forward Network.

For each artificial neuron, given the input values xi with connection weight

12

values wi, the network input is

net(x) =

ni=1

xiwi (2.7)

and the network output of one neuron is

Oi = (net(x) + wbi) + wobi (2.8)

where wbi is the bias weight value of hidden layer, and wobi denotes the biasweight value of output layer.

So for a two-layer neural network with H hidden neurons, it has the networkoutput

O =

Hj=1

(netj(x) + wbj) + wobj (2.9)

netj(x) is the jth network input value.

13

Chapter 3

Methodology

In this work, model build flow is mainly divided into two parts, one part is datapreparation and another part is neural network training. So in this chapter, themethodologies that are used for these two parts are presented in detail.

3.1 Neural Network Algorithm

The neural network is trained with mainly four procedures: Initialization, Train-ing, Validation and Generalization. In this section the necessary algorithmsused for neural networks are described in detail, consider both the efficiencyand precision in applications.

3.1.1 Weights Initialization Algorithm

The first step of training the network is to initialize the neural networkconnection weights. The original method is to randomize the connection weightsto small values, in order to achieve the training goal. For each step of trainingiteration with randomly initialized weight values, the training time and epochsfor reaching the goal are different, and the difference could be quite large. Forexample, given the data samples, two networks are trained with two differentinitialized weights, W1 and W2, for the same training goal. And the timeconsumptions are T1 and T2 respectively, where T1 > T2. It is obvious thatfaster training procedure is expected, so it can be concluded that the values ofW2 are closer to the final answer. Thus a problem is brought up that whetherthe weight values can be randomized according to specific functions so thatthe training procedure can be faster. The most famous initialization method iscalled Nguyen-Widrow algorithm, which is developed by Derrick Nguyen andBernard Widrow for improving the learning speed of two or more layer neuralnetworks through initializing the weight values of hidden layers into its ownintervals[15]. From the equation 2.7, it can be seen that each term of the sum

14

is a linear function of x over a small interval, and this interval is determined bythe values of connection weights. So it is reasonable to first put the weights intoits own interval and then train the network with less time. The algorithm isbased on function approximation theory with BP method and sigmoid activationfunction.

Take a two-layer neural network with only one input and one output as anexample, suppose there are H hidden units, and the function to be approximatedis over the region between -1 and 1, then each hidden unit takes the interval2/H on average. Since the sigmoid function is approximately linear over theinterval:

1 < wix+ wbi < 1 (3.1)

which yields the interval of x,

1wi wbi

< x goal && N M

(a) N = N + 1; E = 0

(b) for i equals from 1 to n(n is the number of samples),

i. Calculate the network output with equation 2.8;

ii. Calculate the SSE with equation 3.5;

iii. Update the weights with equations 3.7, 3.8.

3.1.3 Levenberg-Marquardt Algorithm

Although the BP algorithm was the most widely used algorithm and it isthe most basic and easiest method, it encountered some problems in theimplementation: the most serious one is that the convergence speed is too slow,and this effect is significant when the network is trained to some step. Alsoit has the problem with local minima, the ability of stability, etc. In order toimprove the performance of network training procedure, some modification hasbeen made on the BP method, and one of the most famous and widely used oneis called Levenberg-Marquardt (LM) algorithm. It was independently developedby Kenneth Levenberg and Donald Marquardt and provides a numerical solutionto the problem of minimizing the nonlinear function with fast and stableconvergence. As a learning algorithm for BP method, it has the property ofcontinuous differential. It is hard to know which learning algorithm will bethe best for a given problem because of multiple influence factors, such asnetwork size, training goal and computation complexity. From an experienceinstruction of which training function should be used in MATLAB User Guide,it can be seen that for a function approximation problem, when the computationcomplexity is lower with less number of weights and biases, the LM algorithmperforms very well.

LM algorithm is a combination of steepest descent method and Gauss-Newton (GN) algorithm, overcome the slow convergence of steepest descentmethod and fix the only reasonable quadratic approximation of error functionof GN algorithm[22]. The steepest descent method has been introduced before,while GN algorithm is based on Newtons method with pre-assumption that allthe gradient components of equation 3.6 are functions of weights and all theweights are linearly independent. Thus the gradient vector can be expressed asthe second-order algorithm, where the Hessian matrix H and Jacobian matrixJ is used for simplification, and the update rule of GN algorithm is,

wi+1 = wi (JTi Ji)1Jiei (3.8)

e is the error vector. The modification part of LM algorithm is to make Hessianmatrix JTJ invertible, this is done by approximate Hessian matrix with H JTJ +I, is the combination coefficient. If is very large, the LM algorithmis equal to steepest descent algorithm, while if is very small, the algorithm isapproximate to GN algorithm.

17

3.1.4 Training Procedure

The training procedure of two-layered neural network consists of four step:Initialization, Training, Validation and Generalization.

At the first step, the initialization of weights according to Ngyue-Widrowalgorithm is implemented, and the weights are initialized into its own interval.

To prepare a neural network for training, there are also other parameters tobe initialized:

Training Goal. The goal is the network performance that it is expectedto be achieved, expressed as Mean Square Error (MSE) and calculatedby network output data and target data. Ideal case is to set the goal tozero, meaning that there is no error between the model predicted data andtarget data.

Training Epoch. The maximum number of training epoches is used foravoiding the overtraining case, belonging to early stopping technique.

Maximum Number of Validation Iteration. The number of validation isalso used for monitoring the training procedure, details will be discussedlater.

Hidden Size. The number of hidden neurons.

After the first step, the network is prepared for training with selectedlearning algorithm. The training step of network is also called the learningability of the network, given the input and target pair wise data, the network istrained with pre-defined network settings defined above.

It has been mentioned above that, in this thesis work, any functionapproximation problem can be solved with two-layered neural network as longas the hidden size is proper. So the number of hidden layer is single layer,but the method of choosing a proper hidden size is much more complex thanthe layers. There is no universal theorem on what size of a neural network isoptimal, however, there is a balance between the model precision and the modelcost, and there is a limitation of model precision because of the backgroundnoise level. It has been proved that, not the larger hidden size, the more precisethe network performance is. Sometimes large network size will cause over-fittingproblem. So as a rule of thumb, it is expected that the smaller network sizethe better, under some pre-defined model precise limitation. Experiences showthat the number of hidden size should be 3 to 5 times larger than the numberof input parameters. With data analysis, it is known that the number of inputvariables is more than ten, while the number of input variables that has strongrelationship with task load is usually less than 5. Thus in this thesis work, thehidden size of neural network is set from 5 to 50, and the step is 5. Shownas Figure 3.2, it can be seen that when number of hidden neurons is gettinglarger, the network performance is tending to be changeless, so with a precisionlimitation threshold, denoted as the red dashed line in the figure, the smallestnumber of network size, shown with red circle is picked as the optimal size.

18

Figure 3.2: Network performance on different hidden size.

The performance of network is usually denoted as Mean Squared Error(MSE),

MSE =1

n

ni=1

(yi oi)2 (3.9)

and one expects that the smaller MSE, the better the network performance is.

Training of network usually can give a pretty good result of functionapproximation. However, this good result of network training procedure cannotguarantee the generalization of a trained neural network. The phenomenon iscalled over-fitting, which means that the specific trained network gives a goodperformance of a set of sample data, while the performance of another differentset of sample data from the same population is pretty much worse. In this case,the network just memorized the training samples, but failed to generalize to newsituations. In order to avoid this phenomenon, several methods are implementedto improve the generalization of ANNs.

1. Data Division

Data division means the sample data are used not only for trainingbut also for validation and test. The key point of data division is tovalidate the network with a totally different set of data from trainingdata. In this report, random data division method is selected, with threeset of data, training data, validation data and test data to be dividedaccording to some ratio. In general case, the division ratio is taken as[train ratio:val ratio:test ratio]=[0.7:0.15:0.15]. During the trainingprocedure, the training data is used for neural network training, whilethe validation data is used for monitoring the network performance. Thetraining procedure will go on until there is no improvement on validationperformance for the maximum validation iterations. The test data is

19

used for comparing different network structures on the same sample data.With the monitoring of validation procedure, the network can give a goodgeneralization ability, as presented in Figure 3.2, the blue line denotes thetraining result, while the red and green line denote the validation andtesting data separately. It is obvious that the three data sets have similarresults, meaning that the network has good generalization.

2. Early Stopping

In artificial neural network, early stopping is a method used for avoidingnetwork over-fitting. This method is combined with validation methodduring network training procedure. If the validation error keeps increasingfor several iterations, the network is stopped for training and gives a failureresult. The increase of validation error means that there is no improvementof network performance for more iteration.

3.2 Data Preparation

Data preparation is an important technique of neural network modeling forcomplex data analysis, and it has a huge impact on the success of complex dataanalysis problems. In data analysis problems, experiences show that 50% to70% of effort should be done by data preparation in a project. According to adata preparation design work flow [24], one of the most important steps is datasubset selection. For decades, the best subset selection is considered to bea critical step for data analysis with neural network, especially for a large andcomplex data set. A proper data subset can not only eliminate the predictionerror caused by irrelative input parameters, but also can reduce the cost ofbuilding a model.

3.2.1 Data Selection

An important issue of many problems in model estimation is to select theoptimal subset of the data, which is known as the best subset. The subsetselection is also called variable selection or feature selection in different areas.Given a data set X = xi|i = 1, ..., N , the aim of variable selection is to find asubset Xsub = xi|i = 1, ...,M with M < N , that satisfies the model estimationmean square error criteria E(Xsub). Actually variable selection is a key factorof model estimation with four main objectives:

Unnecessary predictors such as noise or irrelevant variables can be pickedout and removed.

Simplify the prediction model with only necessary data.

Improving the predictive accuracy with model.

The model training procedure will be faster and more cost-efficient.

20

In neural network modeling process, many subset selection methods havebeen developed. Accordingly, the most reliable method for searching thebest data subset is the All Subset Models method, which is also the mostcomputationally consuming method. It consist of all the possible combinationsof n variables, where n is the number of all the predictors, and the totalnumber of combinations is given by: 2n 1. As a consequence, when thenumber of variables becomes larger, the number of possible models will increaseexponentially, thus the method becomes unsuitable. So many other intelligentmethods are developed.

One of the widely used methods is called Connection Weights, whichcalculates the sum of weight values of each variable. And the variable whichcontributes most will have the largest Sum of Weight[17]. Another popularmethod called PaD uses the partial derivatives of network output variable withrespect to predictor variables to determine the influence of the input variableson the output [8]. Variable selection with pruning method is given in [2].Some other methods which also give the relative importance of input variableswith respect to output are presented in [18], these methods can give directinformation on the contributions of predictors and network output. However,most of these methods are developed in the ecological science field. Butin this thesis work, experiment results show that these methods tailored forneural network are highly dependent on the initialization of weights values. Sosome other methods which are implemented within mathematical and machinelearning fields are researched, such as correlation, principal components andsequential searching methods.

Principal Component Analysis

Among all the multivariate techniques, principal component analysis is theoldest and most widely used one, which is originally introduced by Pearson[6].The core idea of PCA is to use a lower dimension data set instead of the originalhigh-dimension data set, while keeps the main data feature. Each of the newprincipal components is the linear combination of the original variables, withthe whole principal components derived in a descending order of importance.This means that the first component captures the largest feature within originaldata set, and the second component is orthogonal to the first component withfewer feature properties than the first one while still stay largest among theremaining components, and so forth.

Details of the deduction of PCA is given as follows:

The first principal component, defined as p1, is a linear combination oforiginal variable set, p1 = a11x1 + a12x2 + ... + a1nxn=a1x, with the largestsample variance among all such linear combinations. A restriction of coefficientvector a1 must be placed in case that the variance of p1 increases withoutlimitation. According to relative work, the sum of squares of the coefficientvector is restricted to one:

a1a1 = 1

21

Then a second principal component p2 = a21x1 + a22x2 + ... + a2nxn isformulated with largest variance subject to the following two conditions

a2a2 = 1

a2a1 = 0

The second condition denotes the orthogonality between the first and secondprincipal component. And the same procedure is applied on the remainingcomponent.

As mentioned above that for pca1, the coefficient vector a1 should give thelargest variance of p1, V ar(p1) = V ar(a

1x) = a

1Sa1 under the constrain

a1a1 = 1, where S is the covariance matrix of x. This gives a result thata1 is the eigenvector of S corresponding to the largest eigenvalue, and so onwith the remaining components.

The usual objective of principal component analysis is to see whether thefirst few components can account for most variation in original data set, if so,then a reduction of dimension can be achieved. This property can also beapplied on variable selection, called variable discarding [10]. The basic idea isthat the variable which dominates in the least principal component account forleast importance, and if the eigenvalue lambda of this principal component issmaller than some threshold, then this variable can be discarded.

The detailed method implementation step is that after the PCA hasbeen performed on all the p original data set, starting with the componentcorresponding to the smallest eigenvalue for each turn, and then discard thevariable with the largest coefficient of the least important component. Thevariable to be discarded should not be associated with previously consideredvariables. Stop the procedure until the threshold criteria of lambda0 has beenachieved, where lambda0 is the smallest eigenvalue of each iteration.

The advantage of PCA variable discard is that it can extract the mainfeatures of sample data, while the disadvantage of PCA is that it does nottake response variable into consideration, so that the strength of relationshipbetween the selected predictors and response variable may be weak. Accordingto Ali S. et al, it is possible that with PCA, the first (p-1) principal componentscontributes nothing toward the reduction of the residual sum of squares, whilethe last component contributes everything [9]. However, the last component isthe one that will be always ignored.

Partial Correlation

When discussing the relationship between the dependent variable and indepen-dent variable, one usually will think of the coefficient of correlation betweenthem. With the absolute range value between 0 and 1, if the coefficient isclosed to 1, then it means that there is a strong relationship between thesetwo variables, otherwise the relationship is weak. However, in this thesis work,there is more than one independent variable. So when one is going to obtain

22

the relationship between one of the independent variables and the dependentvariable, he (she) needs to eliminate the possible effects caused by the otherindependent variables. Thus partial correlation is introduced here. The mainpurpose of partial correlation is to determine the coefficient of correlationbetween target variable and each of the input variables while eliminate theeffects of the other remaining input variables.

Figure 3.3: Graph Illustration of Partial Correlation.

Shown as Figure 3.3, suppose there are two input variables X1, X2 and onetarget variable Y , the variance of Y is denoted as the circles in the figure.

a is the variance of Y cannot be explained by both X1, X2

b is the variance of Y only explained by X1

c is the variance of Y explained by both X1, X2

d is the variance of Y only explained by X2

Variance of Y explained by X1, X2: b+ c

Variance of Y not explained by X1, X2: a

Variance of Y only explained by X1: b

So the coefficient of partial correlation between Y and X1 exclude the effectof X2 is

ba+b . Thus the algorithm of the coefficient of partial correlation is

r2Y.X1|X2 =(rY.X1 rY.X2rY.X2)2

(1 r2Y.X2)(1 r2X1.X2

)(3.10)

Although coefficient of partial correlation can give precise strength ofrelationship between two variables, it still depends on the basic of simplecorrelation, which means that there is a pre-assumption that the relationship

23

between these two variables are assumed to be linear. However, in reality mostof the relationships between variables in physical world are not just simplelinear, there might be numbers of other relationships that cannot be describedby linearity, so another method tailored for neural network is introduced in casepartial correlation cannot give a proper result.

Modified Sequential Feature Selection

In order to fit selected variables to neural network model, in this thesis work,a method combines backward Sequential Feature Selection (SequentialFS) withneural network is developed.

As the definition of SequentialFS, a criteria function is defined to measurewhether there is an improvement or not in sequential feature selection iteration.So in this work, a similar criteria function is made during the network trainingprocedure with the precondition that the network settings are all the same.Given a training data set X = x1, x2, ..., xn and Y = y with a set neuralnetwork, including the initialization of weight and bias, the number of hiddenneurons, and the epochs of training procedure, a sequential searching methodis implemented on the data set until the criteria function is satisfied. From theexperiment results in this report, the criteria function based on SequentialFS isdefined as the ratio between previous network performance and the new networkperformance. For example, first the fixed network is trained with a whole dataset and a network performance is obtained, named as previous performanceMSEpre. Then, with sequential searching on the first iteration, one variable whogives least contribution to the output is left out. Here the contribution meansthat if the specific variable is turned off, the network performance MSE will notchange much, and the fixed network is trained again with remaining data setand a second network performance is obtained named as ongoing performanceMSEog. The criteria is defined as

cire =MSEogMSEpre

(3.11)

If the criteria is larger than 10, cire > 10, meaning that leaving this ongoingvariable xi out will cause the network performance significant worse, then stopsequential searching on this step and the remaining variables are retained.

The algorithm goes as follows:

1. Train NET with full data set X and Y , save the performance as MSE1,and save all the network settings.

Seed generation for random weight values. Network hidden size. Training iterations.

2. Sequentially leave one variable xi out ,train the network with same settingsas step 1, get a set of network performance

MSEtemp = (MSE11,MSE12, ...,MSE1n) (3.12)

24

3. Remove the variable which gives minimum change of network performanceaccording to equation 3.13, and train the network with remaining (n 1)variables, obtain a network performance MSE2.

arg minxMSE = arg min

xMSE MSE1 (3.13)

4. If MSE1MSE2 < 10, repeat step 2,3,4; else, stop searching and keep theremaining variables as selected subset.

As the performance results will be influenced by the initialization of weightvalues, meaning that different seed generation of random weight values willsometimes give different network performance, so in order to obtain a reliableresult, 50 times experiments are done for SequentialFS algorithm and the datasubset is chosen according to a threshold based on the accumulate result.

thre =NiNexp

(3.14)

Ni is the accumulate retained times of variable i, and Nexp is the totalexperiment times, with large amount of experiment results a threshold of 0.9 ischosen as the threshold value.

See Figure 3.4, the performance of each iteration of all the data set basedon leave-one-out method is shown in Figure 3.4, from which it can be seen thatthere is a significance worse change in network performance when removing theeleventh variable. The accumulated result for 50 times experiments is shown inFigure 3.5, denoting that among all the 50 times experiments, variable 6 and 7are chosen as retain variables full time, while some other variable such as variable1 and variable 4 are never chosen. According to the threshold equation 3.14,the variables with threshold larger than 0.9, namely variable 6 and 7 are chosenas selected data subset, which are crossed with the red line in Figure 3.5.

However, when one analyzes details into the slightly fluctuated performancein Figure 3.4, it can be found that there is always a global minima shown asthe red dot in Figure 3.6. This means that at this point, the variables givethe minimum MSE to the network. So a simple modification is made on theSequentialFS method. Instead of picking the point that gives a significanceworse performance of the network, the modified method gives a way of choosingthe point of variables that give a minimum MSE. This method is called ModifiedSequentialFS (MSFS) and the algorithm goes as follow:

1. Train NET with full data set X and Y , save the performance as MSE1,and save all the network settings.

Seed generation for random weight values. Network hidden size. Training iterations.

2. Sequentially leave one variable xi out, train the network with same settingsas step 1, get a set of network performance.

MSEtemp = (MSE11,MSE12, ...,MSE1n) (3.15)

25

3. Remove the variable which gives minimum change of network performanceaccording to equation 3.13 and train the network with remaining (n 1)variables, obtain a network performance MSE2.

4. Repeat step 3 until there is only one variable left, and obtain an array ofperformance MSE = (MSE1,MSE2, ...,MSEn).

5. Choose the subset which gives minimum network performance amongMSE, and save as the optimal data set.

The same subset selection threshold is made, and in this case, five variablesare retained including variable 6 and 7, shown in Figure 3.7.

0 2 4 6 8 10 120

1

2

3

4

5

6

7

8

9x 10

4

MS

E

Network Performance with Sequential Searching method(1)

Figure 3.4: Network Performance withSequentialFS.

1 2 3 4 5 6 7 8 9 10 11 120

5

10

15

20

25

30

35

40

45

50

Input Variables

Accumulate Number of Retain Variablesmethod1

Figure 3.5: Accumulate Number of RetainVariables with SequentialFS.

1 2 3 4 5 6 7 8 9 10 1113.2

13.4

13.6

13.8

14

14.2

14.4

14.6

14.8

MS

E

Network Performance with Sequential Searching method(2)

Figure 3.6: Network Performance withModified SequentialFS.

1 2 3 4 5 6 7 8 9 10 11 120

5

10

15

20

25

30

35

40

45

50

Input Variables

Accumulate Number of Retain Variablesmethod2

Figure 3.7: Accumulate Number of RetainVariables with Modified SequentialFS.

In order to compare these two variable selection methods, twenty networksare trained with different size on each method and the results are shown inTable 3.1 and Table 3.2 on page 27. From the results it is obvious that theModified SequentialFS is better.

26

Seq

uen

tial

Sel

ecti

onN

et1

Net

2N

et3

Net

4N

et5

Net

6N

et7

Net

8N

et9

Net

10

Hid

den

Siz

e5

1015

20

25

30

35

40

45

50

Ep

och

360

391

1000

43

8122

11

78

9P

erfo

rman

ce16

.440

615

.505

214

.7766

15.3

447

14.7

365

14.7

417

14.7

098

14.7

093

14.7

157

14.7

191

Seq

uen

tial

Sel

ecti

onN

et11

Net

12N

et13

Net

14

Net

15

Net

16

Net

17

Net

18

Net

19

Net

20

Hid

den

Siz

e55

6065

70

75

80

85

90

95

100

Ep

och

208

97

79

89

96

Per

form

ance

14.7

835

14.7

010

14.7

514

14.7

359

14.6

946

14.7

600

14.7

167

14.7

238

14.6

968

14.7

231

Tab

le3.

1:N

etw

ork

Perf

orm

ance

Resu

ltw

ith

Sequenti

alF

Sm

eth

od.

Perf

orm

ance

isth

enetw

ork

MSE

.

Mod

ified

Seq

uen

tial

Net

1N

et2

Net

3N

et4

Net

5N

et6

Net

7N

et8

Net

9N

et10

Hid

den

Siz

e5

1015

20

25

30

35

40

45

50

Ep

och

3740

9450

1000

419

456

22

22

28

132

Per

form

ance

5684

71.7

15.4

328

13.8

075

13.5

949

13.3

484

13.4

236

13.5

412

13.7

089

13.3

477

13.5

193

Mod

ified

Seq

uen

tial

Net

11N

et12

Net

13

Net

14

Net

15

Net

16

Net

17

Net

18

Net

19

Net

20

Hid

den

Siz

e55

6065

70

75

80

85

90

95

100

Ep

och

208

97

79

89

96

Per

form

ance

13,2

847

13,4

230

13,4

400

13,4

244

13,6

849

13,5

040

13,4

798

13,2

771

13,2

947

13,5

466

Tab

le3.

2:N

etw

ork

Perf

orm

ance

Resu

ltw

ith

Modifi

ed

Sequenti

alF

Sm

eth

od.

27

3.2.2 Data Normalization

Prior to the neural network training, it is better to transform the data set sothat the predictor and response variables can exhibit particular distributionalcharacteristics as mentioned in [24]. The response variable must be convertedto the range [0,1] so that it conforms to the demands of the transfer function(sigmoid function) used in building the neural network. This is accomplishedby using the formula:

Tn =Yn min(Y )

max(Y )min(Y )(3.16)

where Tn is the converted response value for observation n, and Yn is theoriginal response value for observation n. min(Y ) and max(Y ) represent theminimum and maximum values respectively of the response variable Y . Notethat the response variable does not have to be converted when modeling a binaryresponse variable because its values already fall within this range.

28

Chapter 4

Simulation and Results

In this chapter, the simulation is done with MATLAB and the results arecollected and shown. First, the comparison between different methods ispresented in order to choose the optimal method. Then, one representativeis given for each different kind of tasks and the results are shown with figuresand tables.

4.1 Parameter Settings

Goal = 0

Epoch = 1000

Maximum validation iteration: 10

Data division ratio: train : val : test = 0.70:0.15:0.15

Minimum hidden size: 5

Maximum hidden size: 50

Hidden size step: 5

4.2 Model Build Flow

For all the tasks to be approximated, a certain model building work flow isfollowed and the flow chart is shown in Figure 4.1. Follow the flow chart,for each specific task, the input and target pair wise data are first put intothe Model Tool. With proper subset selection method, the optimal subset ofdata is selected for training the neural network. If the model gives a result thatsatisfies the entire model fit criteria, then this model is accepted and the trainednetwork with specific hidden size and fixed weight values are saved as successful

29

Figure 4.1: Flow Chart of Model Build.

estimation model. However, if the model gives a result that does not satisfy allof the model fit criteria, then whether to accept the model or not depends onthe specific model criteria. Different tasks show different results.

4.2.1 Model Fit Criteria

Precision Level

The background noise has been introduced in Chapter 1, expressed as equa-tion 1.3. In order to measure the model performance, a comparison betweenmodel estimation error and the noise estimation should be made, and the modelestimation is denoted as MSEM ,

MSEM =1

n

Ni=1

(yi yi)2 (4.1)

yi is the estimation of task load with our model. From the models point ofview, the noise estimation can be treated as the limitation of model estimationMSEM , meaning that the closer between MSEN and MSEM , the better themodel is, but MSEM can never be smaller than MSEN .

Also from the table in previous chapter it can be seen that there is almostno improvement of network performance as the number of hidden size increase.So for a neural network approximation problem, not the larger the hidden sizeis, the better the network performance is. Similar results can be found in [12].However, in a model estimation problem, it is expected that the model should

30

be as simple as possible under certain criteria. In neural network, the fewer theneurons are, the simpler the model is. In order to make the seeking procedureof hidden neurons efficient, the seeking rule is defined as follows:

1. Set threshold criteria.

2. Initialize the network with hidden size equal to 5 and other predefinedinitialization parameters.

3. Train the network.

4. If the network training is stopped because of not achieving the threshold,add the hidden size by 5 and repeat step 2 and 3; If the network achievesthe threshold, stop training and save the network.

With the experience from the experiments, a definition of precision criterionhas been made. Here the precision level is defined as the ratio between thedifference of the network performance MSEM and the noise estimation MSENand MSEN , expressed as

=MSEM MSEN

MSEN(4.2)

From amounts of experiment results, two precision levels have been set, withthe higher level equal to 0.1 and the lower level equal to 2. This means thatfor one specific task, if < 0.1 for some number of hidden size, then thisspecific network can satisfy the higher precision criterion. However, if none ofthe networks can satisfy the higher level, an re-examination will be made tocheck whether there is a network can satisfy the second precision level. Thetwo-precision-level criterion was made to first guarantee the accuracy resultsof the model, and second to guarantee that most of the task modeling can beaccepted under the criterion.

Correlation and Regression

The goal of correlation analysis is to see whether the predictor variablesand response variables are co-vary, and also detect the strength of thelinear relationship between them, while regression analysis of a scatter plotgives a visualized impression of the relationship between two variables. Themost common use of correlation analysis is called Pearson Product MomentCorrelation Coefficient (PPMCC), which is calculated as:

R =

ni=1(yi y)(oi o)n

i=1(yi y)2n

i=1(oi o)2(4.3)

yi is the target data and oi is the network output data.

As a rule of thumb, the range of |R| and the strength of the relationshipbetween two variables are shown in the Table 4.1.

31

Absolute r value Strength of relationship[0.5 1] Strong

[0.3 0.5] Moderate[0.1 0.3] Weak[0 0.1] Very weak or None

Table 4.1: Range of Coefficient of Correlation.

For the model evaluation, correlation analysis takes the value of squaredcoefficient of correlation (R squared), which is also called coefficient of determi-nation. It acts as the measurement value in order to give an interpretation ofhow well the data fit a model. The coefficient R2 denotes the proportion of thelinear variation in the target data explained by that in the predicted data. Amodel fit criteria is given as:

R2 > 0.80 (4.4)

On the other hand, the regression relationship between the predicted outputdata and target data is expressed as an explicit equation:

Output = aTarget+ b (4.5)

in which a denotes the regression coefficient slope and b is the intercept. Theregression is used to compare against the 1:1 line.

Ideally, it is expected that predicted output data is equal to target data, andthe regression relationship between two variables is often presented in a scatterplot such as Figure 4.2, where the black circles denote the data point. Thevalue of the black circles perpendicular to x label denote the target data and toy label denote the output data from the model estimation. And the dashed lineperforms the regression line between output and target data. In this case, theregression result is equal to ideal result. Thus a is expected to approximate to1, while b is expected to approximate to 0. However, in real case the data areusually disturbed by noise and sometimes the effect of noise is of vital serious,like Figure 4.3, where the black circles and grey dashed line are the same asFigure 4.2, and the blue circles represent the additional noisy data. From thepoints it is easy to find that target data are larger than output data values.Thus in this case, the regression line is presented as the blue dashed line, whichis inaccurate according to the ideal case, and from the perspective of numericalit can be deduced that a is far away from 1 or b is far away from 0.

The regression criteria for model fit is

a 1 (4.6)

b 0 (4.7)

32

Figure 4.2: Ideal Regression between PredictedOutput Data and Target Data.

Figure 4.3: Noisy Regression between Predict-ed Output Data and Target Data.

4.3 Result and Analysis

4.3.1 Type I Task Representation: Function1

Subset Selection

First of all, the PCA method is used for subset selection, and the selectedpredictors are in Table 4.2. With these three selected predictors, the neuralnetwork is trained with different hidden neuron size from 5 to 50, and theperformance results and precision values are presented in Table 4.6 on page 38.The background noise estimation of Function1 is 259.4977, and from the resultsit can be seen that the subset selected based on PCA cannot achieve the bothof the precision goals < 0.1 and < 2. With large numbers of experiments,the results show that subset selection with PCA can only satisfy the precisionoccasionally, so in this model building flow, the PCA method is not taken intoconsideration any more.

Selected SubsetX1 X2 X7

Table 4.2: Function1 : Subset Selection with PCA.

Since the PCA method cannot satisfy the model requirement, the partialcorrelation method is taken into consideration, with strong relationship betweenpredictor and response variables in Table 4.3. It is obvious that the selected bestsubset is [X5]. After training the network, the result is presented in Table 4.4,where it can be found that the optimal network size is equal to 5.

33

Predictor X1 X2 X3 X4 X5Coefficient of Correlation 0.0352 -0.0628 NaN -0.0084 0.9999

Predictor X6 X7 X8 X9 X10 X11Coefficient of Correlation 0.0119 -0.0093 -0.1766 -0.0032 0.0634 NaN

Table 4.3: Function1 : Coefficient of correlation between predictors and response. NaN meanswhen all the effect of other predictors are eliminated, the present predictor is a constant with respectto response.

Hidden Size 5 ...Training 279.9037 ...Validation 281.3098 ...Test 286.9439 ...Mean 281.1707 ... 0.0835 ...

Table 4.4: Function1 : Network performance with different hidden size based on PartialCorrelation subset selection method. When network size is equal to 5, the threshold is lessthan 0.1, meaning that the precision criteria is achieved, so the network is stopped training.

Model Results

From the subset selection part, with partial correlation method and hiddensize equal to 5, the network performance can satisfy the first model fit criteria:precision criteria. However, the value of network performance cannot intuitivelygive an impression on how good or bad our model performs, so more detailedcriteria such as correlation and regression should be researched.

The regression plot is presented as Figure 4.4, with the analysis of the threesub-figures, several conclusions can be achieved:

The scatter plot between output and target data is almost a line, so theapproximation of output data to target data is perfect.

The coefficient of correlation between output and target data is R =0.99999, so R2 = 0.99998, which is larger than 0.80. Thus one can saythat the relationship between output and target value is pretty strong,and the correlation criteria is satisfied.

The coefficient of regression is approximate to 1 and the offset isapproximate to zero, so the regression criteria is satisfied.

All of the three data set give almost the same performance result, meaningthat the trained network has high generalization ability, this conclusioncan be approved by the performance plot in Figure 4.5.

The scatter plots are shown in Figure 4.6 and Figure 4.7, the former shows therelationship between the selected variable X5 of both target data and predicteddata, while the latter shows the plot of sample data. From both of the figures

34

2000 4000 6000 8000 10000 12000

2000

4000

6000

8000

10000

12000

Target

Out

put ~

= 1*

Tar

get +

0.0

76

Train: R=0.99999

DataFitY = T

2000 4000 6000 8000 10000 12000

2000

4000

6000

8000

10000

12000

Target

Out

put ~

= 1*

Tar

get +

0.

078

Validation: R=0.99999

DataFitY = T

2000 4000 6000 8000 10000 12000

2000

4000

6000

8000

10000

12000

Target

Out

put ~

= 1*

Tar

get +

0.

059

Testing: R=0.99999

DataFitY = T

Figure 4.4: Function1 : Regression plot between target data and estimate output data. The firstwith blue fit line is the result of training data, while the green one is validation data and red oneis test data. Within each sub-figure, the black circle denote the data points of output and targetdata, the colored solid line is the fit result between two variable, while the dashed line is ideal case,where Y = T . Ideally, the fit line is expected to coincide with the dashed line.

0 100 200 300 400 500 600 700 800 900 1000

102

104

106

108

Best Validation Performance is 281.3098 at epoch 1000

Mea

n S

qu

ared

Err

or

(m

se)

1000 Epochs

Train

Validation

Test

Best

Figure 4.5: Function1 : Performance plot of Function1. The figure gives a direct impression ofnetwork performance, with all of the three data set, training, validation and test data having thesame performance result, meaning the network has perfect generalization. The dashed line denotedas best validation is the smallest validation error (MSE) among all the 1000 epochs.

35

it can be seen that the predicted data are consistent with the target data, thusthis trained network can be accepted as Function1 model.

0 10 20 30 40 50 60 70 80 90 1000

2000

4000

6000

8000

10000

12000

14000

X5

Act

ion

Cyc

le

Target PointApproxi Point

Figure 4.6: Function1 : Scatter plot ofpredictor with respect to target data.

0 10 20 30 40 50 60 70 80 90 1000

2000

4000

6000

8000

10000

12000

14000

NO.of samples

Act

ion

Cyc

les

Scatter Plot Function 1

Target PointApproxi Point

Figure 4.7: Function1 : Scatter plot of 100sample points.

The trained neural network structure is shown in Figure 4.8, with one input,one output, five hidden neurons and associated bias neurons. bh and bo arehidden bias neurons and output bias neuron, while IW and LW are input layerweight values and layered weight values, respectively. The figure also indicatesthat the activation function of hidden layer and output layer are sigmoid andlinear function, respectively. The weight values of trained neural network are inTable 4.5.

Figure 4.8: Function1 : Trained neural network of Function1.

Same type functions

The results of the other functions of the same type as Function1 are presentedin Table 4.7 on page 38.

36

Connection Weights Bias Weights

IW LW bh bo5.02261.43410.07403.397187.2458

5.21740.050016.86550.01710.0120

T

7.89760.02970.55261.750982.2127

3.2383

Table 4.5: Function1 : Trained neural network weights results.

37

Hid

den

Siz

e5

1015

20

25

30

35

40

45

50

Tra

inin

g1.

1046

1.10

511.1

067

1.0

994

1.0

892

1.1

029

1.1

037

1.1

060

1.1

048

1.1

020

Val

idat

ion

1.09

841.

0835

1.1

002

1.0

942

1.1

431

1.1

023

1.0

782

1.0

821

1.1

216

1.0

935

Tes

t1.

0903

1.10

351.0

792

1.1

189

1.1

173

1.0

945

1.1

147

1.0

997

1.0

663

1.1

081

Mea

n1.

1015

1.10

161.1

016

1.1

015

1.1

015

1.1

015

1.1

015

1.1

015

1.1

016

1.1

016

0.

0042

0.00

420.0

042

0.0

042

0.0

042

0.0

042

0.0

042

0.0

042

0.0

042

0.0

042

Tab

le4.

6:Function1

:N

etw

ork

perf

orm

ance

wit

hdiff

ere

nt

hid

den

size

base

don

Pri

ncip

al

Com

ponent

Analy

sis

subse

tse

lecti

on

meth

od.

The

perf

orm

ance

valu

es

and

thre

shold

rati

ois

mult

iplied

by

107.

Fu

nct

ion

SS

Met

hod

Su

bse

tH

idd

enS

ize

MSEN

MSEM

Mod

elF

itC

rite

ria

R

2a

bF

un

ctio

n2

PC

[X5,X

7]

15

327.7

430

350.4

435

0.0

693

0.9

999

0.9

999

0.2

697

Fu

nct

ion

3P

C[X

5,X

7]

20

27390

28730

0.0

489

0.9

841

0.9

845

41.5

972

Fu

nct

ion

4P

C[X

5,X

7]

20

3393.6

3427.1

0.0

099

0.9

996

0.9

994

2.5

740

Fu

nct

ion

6P

C[X

5,X

6,X

7]

10

1183.6

1232.5

0.0

413

1.0

000

1.0

001

0.3

643

Fu

nct

ion

9P

CX

55

1924.4

1895.4

-0.0

151

0.9

988

0.9

988

5.9

365

Fu

nct

ion

13P

CX

55

66.7

829

67.4

440

0.0

099

1.0

000

1.0

000

0.0

026

Fu

nct

ion

15M

SF

S[X

4,X

5]

5618.8

5651.8

522

0.0

533

0.9

999

1.0

000

0.1

769

Fu

nct

ion

17P

CX

510

2212.1

2219.e

+03

0.0

034

0.9

999

0.9

999

0.4

306

Tab

le4.

7:Functi

on

appro

xim

ati

on

resu

lts

of

Typ

eI

Functi

on.

SS:

Sequenti

al

Sele

cti

on.

PC

:P

art

ial

Corr

ela

tion.

MSF

S:

Modifi

ed

Sequenti

alF

S.

38

The subset selection of some of the functions gives one retained variable whilethe others give two. Take Function2 as an example, the selected variables are[X5, X7], respectively. The scatter plots are shown in Figure 4.9 and Figure 4.10.In the first figure, there are two similar lines, which means that for each specificnumber of X5, there are two values of task load, so it is easy to see that thetask load with respect to predictor X5 is also affected by another variable. Byfurther data analysis, the other controlled variable is found to be X7.

Another special case of function is Function15. With partial correlationmethod to choose subset data and train the network, the network performancecannot meet the requirement, see Table 4.8. One suspicion of this phenomenonis that the subset selection is incorrect, thus Modified SequentialFS method isused for Function15, in order to save execution time, ten times of experimentshave been done and the accumulate result of variable selection is shown inFigure 4.11.

0 10 20 30 40 50 60 70 80 90 1000

1000

2000

3000

4000

5000

6000

X5

Act

ion

Cyc

le

Target PointApproxi Point

Figure 4.9: Function2 : Scatter plot ofpredictor 5 with respect to task load.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1000

2000

3000

4000

5000

6000

X7

Act

ion

Cyc

le

Target PointApproxi Point

Figure 4.10: Function2 : Scatter plot ofpredictor 7 with respect to task load.

1 2 3 4 5 6 7 8 9 10 110

1

2

3

4

5

6

7

8

9

10

Predictor Variables

Accumulate Number of Retain Variables Function 15

Figure 4.11: Function2 : Subset selection with Modified SequentialFS method.

39

Hid

den

Siz

e5

1015

20

25

30

35

40

45

50

Tra

inin

g20

888

2000

819902

19444

19626

19409

19191

18902

19502

19685

Val

idat

ion

2179

418

224

18566

20110

20763

18615

19427

19830

18659

18998

Tes

t20

356

1888

616826

18907

17600

18898

19168

20105

18456

17266

Mea

n20

944

1957

219240

19463

19492

19213

19223

19221

19219

19219

32

.844

30.6

2730.0

930.4

530.4

98

30.0

47

30.0

62

30.0

630.0

56

30.0

56

Tab

le4.

8:Function15

:Functi

on

appro

xim

ati

on

resu

lts

wit

hP

art

ial

Corr

ela

tion.

Hid

den

Siz

e5

1015

20

25

30

35

40

45

50

Tra

inin

g22

69.9

1849

.92044.2

1959.6

1970.9

1969.8

2087.3

1849.3

1988.3

1753.8

Val

idat

ion

3248

.514

29.4

2247.6

1984.6

2318.5

4411.4

1758.1

2446.8

5605.9

1897.7

Tes

t16

14.6

2072

.12621

3384.7

2544.8

2339.7

2604.1

16691

68571

2003.2

Mea

n23

18.4

1820

.22161.4

2177.3

2109.3

2392

2115.4

4167.7

12530

1812.9

1.

3706

0.86

115

1.2

101

1.2

264

1.1

568

1.4

459

1.1

631

3.2

616

11.8

12

0.8

5369

Tab

le4.

9:Function12

:N

etw

ork

perf

orm

ance

wit

hdiff

ere

nt

hid

den

size

base

don

Part

ial

Corr

ela

tion

subse

tse

lecti

on

meth

od.

40

4.3.2 Type II Task Representation: Function12

Subset Selection

For Function12, with partial correlation method, the coefficient of correlation ispresented in Table 4.10. According to the range of strong strength, the selectedsubset is [X2, X6, X7, X8], and the network performance results are shown inTable