View
215
Download
0
Category
Preview:
Citation preview
University of Missouri, Department of Computer ScienceUniversity of Missouri, Informatics Institute
Sean Lander, Master’s Candidate
An Evolutionary Method for Training Autoencoders for Deep Learning NetworksMASTER’S THESIS DEFENSE
SEAN LANDER
ADVISOR: YI SHANG
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
AgendaoOverviewoBackground and Related WorkoMethodsoPerformance and TestingoResultsoConclusion and Future Work
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
AgendaoOverviewoBackground and Related WorkoMethodsoPerformance and TestingoResultsoConclusion and Future Work
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
OverviewDeep Learning classification/reconstructionoSince 2006, Deep Learning Networks (DLNs) have changed the landscape of classification problemsoStrong ability to create and utilize abstract featuresoEasily lends itself to GPU and distributed systemsoDoes not require labeled data – VERY IMPORTANToCan be used for feature reduction and classification
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
OverviewProblem and proposed solutionoProblems with DLNs:oCostly to train with large data sets or high feature spacesoLocal minima systemic with Artificial Neural NetworksoHyper-parameters must be hand selected
oProposed Solutions:oEvolutionary based approach with local search phaseo Increased chance of global minimumoOptimizes structure based on abstracted featuresoData partitions based on population size (large data only)oReduced training timeoReduced chances of overfitting
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
AgendaoOverviewoBackground and Related WorkoMethodsoPerformance and TestingoResultsoConclusion and Future Work
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
BackgroundPerceptronsoStarted with Perceptron in 1950oOnly capable of linear separabilityoFailed on XOR
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
BackgroundArtificial Neural Networks (ANNs)oANNs went out of favor until the Multilayer Perceptron (MLP) introducedoPro: Non-linear classificationoCon: Time consuming
oAdvance in training: BackpropagationoIncreased training speedsoLimited to shallow networksoError propagation diminishes anumber of layers increase
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
BackgroundBackpropagation using Gradient DescentoProposed in 1988, based on classification erroroGiven m training samples:
oFor each sample where calculate its error:
oFor all m training samples the total error can be calculated as:
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
BackgroundDeep Learning Networks (DLNs)oAllows for deep networks with multiple layersoLayers pre-trained using unlabeled dataoLayers are “stacked” and fine tunedoMinimizes error degradation for deepneural networks (many layers)
oStill costly to trainoManual selection of hyper-parametersoLocal, not global, minimum
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
BackgroundAutoencoders for reconstructionoAutoencoders can be used forfeature reduction and clusteringo“Classification error” is the abilityto reconstruct the sample inputoAbstracted features – output fromthe hidden layer – can be used toreplace raw input for othertechniques
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Related WorkEvolutionary and genetic ANNsoFirst use of Genetic Algorithms (GAs) in 1989oTwo layer ANN on a small data setoTested multiple types of chromosomal encodings and mutation types
oLate 1990s and early 2000s introduced other techniquesoMulti-level mutations and mutation priorityoAddition of local search in each generationoInclusion of hyper-parameters as part of the mutationoIssue of competing conventions starts to appearo Two ANNs produce the same results by sharing the same nodes but in a permuted order
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Related WorkHyper-parameter selection for DLNsoMajority of the work explored using newer technologies and methods such as GPU and distributed (MapReduce) trainingoImproved versions of Backpropagation, such as Conjugated Gradient or Limited Memory BFGS were tested under different conditionsoMost conclusions pointed toward manual parameter selection via trial-and-error
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
AgendaoOverviewoBackground and Related WorkoMethodsoPerformance and TestingoResultsoConclusion and Future Work
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Method 1Evolutionary Autoencoder (EvoAE)oIDEA: Autoencoders’ power are in their feature abstraction, the hidden node outputoTraining many AEs willmake more potentialabstracted featuresoBest AEs will contain thebest featuresoJoining these featuresshould create a better AE
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Method 1Evolutionary Autoencoder (EvoAE)
x x’
A3
A4
A1
A2
h x
B3
C2
B1
B2
h
Initi
aliza
tion
Loca
l Sea
rch
x’
Cros
sove
rM
utati
on
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Method 1ADistributed learning and Mini-batchesoTraining of generic EvoAE increases in time linearly to the size of the populationoANN training time increases drastically with data sizeoTo combat this, mini-batches can be used where each AE is trained against a batch and updatedoBatch size << total data
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Method 1ADistributed learning and Mini-batchesoEvoAE lends itself to distributed systemoData duplication and storage now an issue due to data duplication
Train• Forward propagation• Backpropagation
Rank• Calculate error• Sort
GA• Crossover• Mutate
Batch 1
Batch 2
…
Batch N
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Method 2EvoAE Evo-batchesoIDEA: When data is large, small batches can be representativeoPrevents overfitting as nodes being trained are almost always introduced to new dataoScales well with large amounts of data even when parallel training is not possibleoWorks well on limited memory systems by increasing size of the population, thus reducing data per batchoQuick training of large populations, equivalent to training a single autoencoder using traditional methods
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Method 2EvoAE Evo-batches
Data A Data B Data C Data D
Data D
Data C
Data B
Data A
Original Data
Local SearchCrossoverMutate
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
AgendaoOverviewoBackground and Related WorkoMethodsoPerformance and TestingoResultsoConclusion and Future Work
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Performance and TestingHardware and testing parametersoLenovo Y500 laptopoIntel i7 3rd generation 2.4GHzo12 GB RAM
oAll weights randomly initialized to N(0,0.5)Parameter Wine Iris Heart Disease MNIST
Hidden Size 32 32 12 200
Hidden Std Dev NULL NULL NULL 80
Hidden +/- 16 16 6 NULL
Mutation Rate 0.1 0.1 0.1 0.1
Parameter Defaults
Learning Rate 0.1
Momentum 2
Weight Decay 0.003
Population Size 30
Generations 50
Epochs/Gen 20
Train/Validate 80/20
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Performance and TestingBaseline
Learning rate Learning rate * 0.1
oBaseline is a single AE with 30 random initializationsoTwo learning rates to create two baseline measurementsoBase learning rateoLearning rate * 0.1
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Performance and TestingData partitioningoThree data partitioning methods were usedoFull dataoMini-batchoEvo-batch
Full data Mini-batch Evo-batch
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Performance and TestingPost-training configurationsoPost-training run in the following waysoFull data (All)oBatch data (Batch)oNone
Full data Batch data None
All sets below are using the Evo-batch configuration
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
AgendaoOverviewoBackground and Related WorkoMethodsoPerformance and TestingoResultsoConclusion and Future Work
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsParameters ReviewParameter Wine MNIST
Hidden Size 32 200
Hidden Std Dev NULL 80
Hidden +/- 16 NULL
Mutation Rate 0.1 0.1
Parameter Defaults
Learning Rate 0.1
Momentum 2
Weight Decay 0.003
Population Size 30
Generations 50
Epochs/Gen 20
Train/Validate 80/20
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsDatasetsoUCI wine dataseto178 sampleso13 featureso3 classesoReduced MNIST dataseto6000/1000 and 24k/6k training/testing sampleso784 featureso10 classes (0-9)
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsSmall datasets - UCI Wine
Parameter Wine
Hidden Size 32
Hidden Std Dev NULL
Hidden +/- 16
Mutation Rate 0.1
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsSmall datasets - UCI WineoBest error-to-speed:oBaseline 1
oBest overall error:oFull data All
oFull data is fast onsmall scale dataoEvo- and mini-batchnot good on smallscale data
Parameter Wine
Hidden Size 32
Hidden Std Dev NULL
Hidden +/- 16
Mutation Rate 0.1
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsSmall datasets – MNIST 6k/1k
Parameter MNIST
Hidden Size 200
Hidden Std Dev 80
Hidden +/- NULL
Mutation Rate 0.1
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsSmall datasets – MNIST 6k/1koBest error-to-time:oMini-batch None
oBest overall error:oMini-batch Batch
oFull data slowsexponentially onlarge scale dataoEvo- and mini-batchclose to baseline speed
Parameter MNIST
Hidden Size 200
Hidden Std Dev 80
Hidden +/- NULL
Mutation Rate 0.1
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsMedium datasets – MNIST 24k/6k
Parameter MNIST
Hidden Size 200
Hidden Std Dev 80
Hidden +/- NULL
Mutation Rate 0.1
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsMedium datasets – MNIST 24k/6koBest error-to-time:oEvo-batch None
oBest overall error:oEvo-batch Batch ORoMini-batch Batch
oFull data too slow torun on datasetoEvoAE w/ population30 trains as quickly asa single baseline AEwhen using Evo-batch
Parameter MNIST
Hidden Size 200
Hidden Std Dev 80
Hidden +/- NULL
Mutation Rate 0.1
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
AgendaoOverviewoBackground and Related WorkoMethodsoPerformance and TestingoResultsoConclusion and Future Work
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ConclusionsGood for large problemsoTraditional methods are still preferred choice for small problems and toy problemsoEvoAE with Evo-batch produces effective and efficient feature reduction given a large volume of dataoEvoAE is robust against poorly-chosen hyper-parameters, specifically learning rate
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Future WorkoImmediate goals:oTransition to distributed system, MapReduce based or otherwiseoHarness GPU technology for increased speeds (~50% in some
cases)
oLong term goals:oOpen the system for use by novices and non-programmersoMake the system easy to use and transparent to the user for both
modification and training purposes
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
Thank you
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
BackgroundBackpropagation with weight decayoWe use this new cost to update weights and biases given some learning rate α:
oCost is prone to overfitting - weight decay variable λ is added
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
BackgroundConjugated Gradient DescentoThis can become stuck in a loop, however, so we add a momentum term β
oThis adds memory to the equation, as we use previous updates
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
BackgroundArchitecture and hyper-parametersoArchitecture and hyper-parameter selection usually done through trial-and-erroroManually optimized and updated by handoDynamic learning rates can beimplemented to correct forsub-optimal learning rate selection
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsSmall datasets – UCI IrisoThe UCI Iris dataset has 150 samples with 4 features and 3 classesoBest error-to-speed:oBaseline 1
oBest overall error:oFull data None
Parameter Iris
Hidden Size 32
Hidden Std Dev NULL
Hidden +/- 16
Mutation Rate 0.1
Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science
ResultsSmall datasets – UCI Heart DiseaseoThe UCI Heart Disease dataset has 297 samples with 13 features and 5 classesoBest error-to-time:oBaseline 1
oBest overall error:oFull data None
Parameter Heart Disease
Hidden Size 12
Hidden Std Dev NULL
Hidden +/- 6
Mutation Rate 0.1
Recommended