28
1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface texture, size, angle, radius). After a workpiece has been manufactured, it is necessary to verify the derived characterization (tolerance) parameters to ensure it performs its designed functions. Though the designed functions of a workpiece are the comprehensive effect of several characterization parameters, surface texture is often the dominant one. The roughness, waviness and form are three main characteristics on functional surfaces (e.g., roughness may affect the lifetime, efficiency and fuel consumption of a part) [1,2]. Obtaining roughness, waviness and form frequency bands relevant to a surface is normally accomplished by using filtration techniques to extract a surface texture signal. In principle, filtration must be carried out before A generic parallel computational framework of lifting wavelet transform for online engineering surface filtration Yuanping Xu a Chaolong Zhang a, b, [email protected] Zhijie Xu b Jiliu Zhou c Kaiwei Wang d Jian Huang a a School of Software Engineering, Chengdu University of Information Technology, No.24 Block 1, Xuefu Road, Chengdu 610225, People's Republic of China b School of Computing & Engineering, University of Huddersfield, Queensgate HD1 3DH, Huddersfield, UK c School of Computer Science, Chengdu University of Information Technology, No.24 Block 1, Xuefu Road, Chengdu 610225, People's Republic of China d School of Optical Science and Engineering, Zhejiang University, Hangzhou 310027, People's Republic of China Corresponding author at: School of Software Engineering, Chengdu University of Information Technology, No.24 Block 1, Xuefu Road, Chengdu 610225, People's Republic of China. Abstract Nowadays, complex geometrical surface texture measurement and evaluation require advanced filtration techniques. Discrete wavelet transform (DWT), especially the second-generation wavelet (Lifting Wavelet Transform – LWT), is the most adopted one due to its unified and abundant characteristics in measured data processing, geometrical feature extraction, manufacturing process planning, and production monitoring. However, when dealing with varied complex functional surfaces, the computational payload for performing DWT in real-time often becomes a core bottleneck in the context of massive measured data and limited computational capacities. It is a more prominent problem for the areal surface texture filtration by using 2D DWT. To address the issue, this paper presents a generic parallel computational framework for lifting wavelet transform (GPCF- LWT) based on Graphics Process Unit (GPU) clusters and the Compute Unified Device Architecture (CUDA). Due to its cost-effective hardware design and the powerful parallel computing capacity, the proposed framework can support online (or near real-time) engineering surface filtration for micro- and nano-scale surface metrology through exploring a novel parallel method named LBB model, the improved algorithms of lifting scheme and three implementation optimizations on the heterogeneous multi-GPU systems. The innovative approach enables optimizations on individual GPU node through an overarching framework that is capable of data - oriented dynamic load balancing (DLB) driven by a fuzzy neural network (FNN). The paper concludes with a case study on filtering and extracting manufactured surface topographical characteristics from real surfaces. The experimental results have demonstrated substantial improvements on the GPCF-LWT implementation in terms of computational efficiency, operational robustness, and task generalization. Keywords: Lifting wavelet; LBB; Surface texture Measurement; Multi-GPU; CUDA; FNN

1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

1IntroductionEngineeringworkpiecesaredesigned,manufacturedandmeasuredtomeettolerancespecificationsofvariousgeometricalcharacteristics(e.g.,surfacetexture,size,angle,radius).Afteraworkpiecehasbeenmanufactured,it

isnecessarytoverifythederivedcharacterization(tolerance)parameterstoensureitperformsitsdesignedfunctions.Thoughthedesignedfunctionsofaworkpiecearethecomprehensiveeffectofseveralcharacterizationparameters,

surfacetextureisoftenthedominantone.Theroughness,wavinessandformarethreemaincharacteristicsonfunctionalsurfaces(e.g.,roughnessmayaffectthelifetime,efficiencyandfuelconsumptionofapart)[1,2].

Obtainingroughness,wavinessandformfrequencybandsrelevant toasurface isnormallyaccomplishedbyusingfiltrationtechniquestoextractasurfacetexturesignal. Inprinciple, filtrationmustbecarriedoutbefore

Agenericparallelcomputationalframeworkofliftingwavelettransformforonlineengineeringsurfacefiltration

YuanpingXua

ChaolongZhanga,b,⁎

[email protected]

ZhijieXub

JiliuZhouc

KaiweiWangd

JianHuanga

aSchoolofSoftwareEngineering,ChengduUniversityofInformationTechnology,No.24Block1,XuefuRoad,Chengdu610225,People'sRepublicofChina

bSchoolofComputing&Engineering,UniversityofHuddersfield,QueensgateHD13DH,Huddersfield,UK

cSchoolofComputerScience,ChengduUniversityofInformationTechnology,No.24Block1,XuefuRoad,Chengdu610225,People'sRepublicofChina

dSchoolofOpticalScienceandEngineering,ZhejiangUniversity,Hangzhou310027,People'sRepublicofChina

⁎Correspondingauthorat:SchoolofSoftwareEngineering,ChengduUniversityofInformationTechnology,No.24Block1,XuefuRoad,Chengdu610225,People'sRepublicofChina.

Abstract

Nowadays, complex geometrical surface texturemeasurement and evaluation require advanced filtration techniques. Discrete wavelet transform (DWT), especially the second-generation wavelet (LiftingWavelet

Transform–LWT),isthemostadoptedoneduetoitsunifiedandabundantcharacteristicsinmeasureddataprocessing,geometricalfeatureextraction,manufacturingprocessplanning,andproductionmonitoring.However,

when dealingwith varied complex functional surfaces, the computational payload for performingDWT in real-time often becomes a core bottleneck in the context ofmassivemeasured data and limited computational

capacities.Itisamoreprominentproblemforthearealsurfacetexturefiltrationbyusing2DDWT.Toaddresstheissue,thispaperpresentsagenericparallelcomputationalframeworkforliftingwavelettransform(GPCF-

LWT)basedonGraphicsProcessUnit(GPU)clustersandtheComputeUnifiedDeviceArchitecture(CUDA).Duetoitscost-effectivehardwaredesignandthepowerfulparallelcomputingcapacity,theproposedframework

cansupportonline(ornearreal-time)engineeringsurfacefiltrationformicro-andnano-scalesurfacemetrologythroughexploringanovelparallelmethodnamedLBBmodel,theimprovedalgorithmsofliftingschemeand

three implementationoptimizationson theheterogeneousmulti-GPUsystems.The innovativeapproachenablesoptimizationson individualGPUnode throughanoverarching framework that iscapableofdata-oriented

dynamic load balancing (DLB) driven by a fuzzy neural network (FNN). The paper concludes with a case study on filtering and extractingmanufactured surface topographical characteristics from real surfaces. The

experimentalresultshavedemonstratedsubstantialimprovementsontheGPCF-LWTimplementationintermsofcomputationalefficiency,operationalrobustness,andtaskgeneralization.

Keywords:Liftingwavelet;LBB;SurfacetextureMeasurement;Multi-GPU;CUDA;FNN

Page 2: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

suitablenumericalsurfacetextureparameterscanbeidentified.Itmeansthatgeometricalcharacteristicinformationrelatingcloselywithfunctionalsurfacescanbeextractedpreciselyfrommeasureddatasetsbyusingappropriate

filters.Thisensuresthepotentialofestablishingfunctionalcorrelationsbetweentolerancespecificationandverification[1–3].

The rapid progress of filtration techniques has presentedmetrologists a set of advanced tools, e.g.Gaussian filters,morphological filters andwavelet filters [4–7].Gaussian filtering is commonly used in surface texture

measurementsandhasbeenrecommendedbyISO16,610andASMEB46standards forestablishingareferencesurface [8–10].However,usingGaussian filter toextractsurface topographicalcharacteristicsmustbebasedona

presuppositionthatarawsurfacetexturesignal isacombinationofaseriesofharmoniccomponents[10].Unfortunately,arealengineeringsurfacetexturemeasurementcontainsnotonlyregularsignalsbutalsononperiodicand

randomsignals,e.g.deepvalleys,highpeaksandscratchesthataregeneratedduringvariousmanufacturingprocesses,sothatextraction,filtrationandanalysisofthesesignalsareindispensablemethodsformonitoringhowprocess

information(e.g.surfacetextureparameters)relatetovariabilityofthemanufacturingprocesses[2].Thus,straightapplicationsofGaussianfilterstruggletosatisfythechallengingrequirementsofsurfacetexturemeasurementand

analysis.Incontrast,thediscretewavelettransform(DWT)hasshowngreatpotentialinhandlingthecomplexissuesefficiently.DWThasagreatdiversityofmotherwavelets(e.g.Haar,Daubechies,CDFbiorthogonal,Coiflet)thatcan

beappliedaccordingtodifferentapplicationpurposes.Moreover,DWThasadvantagesonmulti-resolutionanalysisandlowercomputationalcomplexity.

Recently,surfacetexturemeasurementhasenteredtheeraofminiaturization,e.g.variousexplorationsonmicro-ornano-machiningandmicro-electromechanicalsystems(MEMS)manufacturing[1].Hencethe increasing

demandsforefficienthandlingofbig(measurement)data,processingaccuracy,andreal-timeperformancehavebeentreatedasparamount.Asthestatusquo,thecomputationalperformanceofDWTisacorebottleneckforhandling

high-throughputinonline(ornearreal-time)manner[11],especiallyfortwo-dimensionalDWT(2DDWT)inarealsurfacetexturemeasurement[12].

Toovercomethisbottleneck,purealgorithmicaccelerationsolutionshavebeenhotlyinvestigatedinthelastdecade.SweldensproposedanewDWTmodel,itknownasliftingwavelettransform(LWT)orsecond-generation

wavelet that improved computational performance and optimized the memory consumption compared to the first-generation wavelet – —Mallet algorithm [13]. Although promising, lifting scheme still has limited satisfaction

performance in accelerating computations ofDWT for online applicationswhen facing large-scaledata size.Basedon that, hardwareaccelerationshavebecomea research focus.GraphicsProcessingUnits (GPU) enables those

solutionsandhasbecomethestandardaccessoryforcomputation-intensiveapplications.Unfortunately,currentliftingschemestrugglesonasingleGPUformeetingthedemandsofonlineprocess,especiallywhendealingwithlarge-

scaleanddynamicdatasetstodefinehighprecisionrequirementsforsurfacetexturemetrology.

Totacklethechallenges,thispaperpresentsagenericparallelcomputationalframeworkforLWT(GPCF-LWT)basedontheoff-the-shelfconsumer-gradeCUDAmulti-GPUsystems.TheproposedGPCF-LWTcanaccelerate

computational-intensivewavelets,sothatitcansupportonlineengineeringsurfacefiltrationforhighprecisionsurfacemetrologythroughexploringaninnovativecomputationalmodel(namedtheLBBmodel)forimprovingtheusage

ofGPUsharedmemoryandanimprovedliftingschemeforreducingtheinherentsynchronizationsintheclassicliftingscheme.TheGPCF-LWTalsocontainsthreeimplementationoptimizationsbasedonthehardwareadvancementsof

an individual GPU node and the novel dynamic load balancing (DLB)model by using the improved fuzzy neural network (FNN) for heterogeneousmulti-GPUs. Therefore, it supports intelligent wavelet transform selections in

accordancewithdifferentsurfacecharacterizationrequirementsandmeasurementconditions.

Therestofthispaperisorganizedasfollows.Section2givesabriefreviewofthepreviousrelatedworks,whichleadstofurtherclarifythecontributionsofthisresearch.Anin-depthanalysisofthefeasibilityofalgorithm

improvementsforadaptingparallelcomputationarediscussedinSection3followingonabriefintroductionoftheliftingschemebasedsecond-generationwavelettransform.Section4presentstheGPCF-LWTframeworkanditsthree

keycomponents.TheLBBmodel forGPCF-LWThasbeenexploredbasedonadetailedevaluationof the threeclassicparallelcomputationmodels for liftingscheme implementations tosolve theproblemofGPUsharedmemory

shortageduringLWTcomputation.ToincreasetheparallelizabilityofLWT,animprovedliftingschemeforreducingtheinherentsynchronizationshasbeendefined,andtofurtheracceleratetheLWTcomputation,threecomplementary

implementationoptimizationsinaccordancewiththelatesthardwareadvancementsofGPUcardshavebeendiscussed.Moreover,tomaximizeparallelizationofGPCF-LWTonaheterogeneousmulti-GPUplatform,Section5introduces

anoveldynamicloadbalancing(DLB)model,inwhichamoreeffectiveFNNhasbeendesignedbasedonthepreviousworkofauthors.AcasestudyisdemonstratedandevaluatedinSection6totestandverifytheresearchideas.

Finally,Section7concludesthispaperandsummarizesthefutureworks.

2PreviousrelatedworksModern GPUs are not only powerful graphics engines, but also highly parallel arithmetic and programmable processors. GPU is a SIMD (Single InstructionMultiple Data) parallel device containing several streaming

multiprocessors(SM),andeachSMhasmultipleGPUcores(alsonamedCUDAcores),i.e.,streamprocessors(SP)whicharethecomputationalcoresforprocessingmultiplethreadssimultaneously(eachGPUcoremanagesathread).

Besidesforvisualizationandrendering,othermoregeneral-purposecomputationsusingGPU(socalledGPGPU)havealsoprosperoussince2002whenconsumer-gradegraphicscardsbecametruly‘‘programmable’’,rangingfrom

numericcomputationaloperations[14–16]tocomputeraideddesign,toleranceandmeasurementareas.Morespecifically,GPGPUhasbeenproposedtosolvenumerouscompute-intensiveproblemswhichcannotbesolvedbyusing

CPUbasedserialprocessingmode,e.g.,Mccool introducedtraditionalGPUhardwareandGPUprogrammingmodelsforsignalprocessingandsummarizedsomeclassicsignalprocessingapplicationsbasedonGPU[17].Suet al.

Page 3: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

proposed a GPGPU framework to accelerate 2DGaussian filtering, and it achieved about 4.8 times speedupwithout reducing the filtering quality [18]. However, traditional GPGPU programmingmust be codedwith graphical

programminginterfacessuchasOpenGLandDirectX,soheavymanualworkswereinvolvedinthewholeprocedureofdevelopingGPGPUprograms,e.g.,texturemappings,manualandstaticthreading,memorymanagement,and

shadingprogramexecution.Thus,thisobsoleteGPGPUprogrammingprocedureisverycomplicated,anditalsolimitedtheparallelcomputingabilityofGPUcards.

Toalleviatetheseapplicationdifficulties, in2007,NVIDIAintroducedCUDAthat isaprogramingframeworkforgeneral-purposecomputationandprovidesascalableandintegratedprogrammingmodelforallocatingand

organizingprocessingthreadsandmappingthemintoSMsandCUDAcoreswithdynamicaladaptionabilityforallmainstreamGPUarchitectures[19].Inaddition,CUDAmaintainsasophisticatedmemoryhierarchy(e.g.registers,

localmemory,sharedmemory,globalmemoryandconstantmemory)inwhichdifferentmemoriescanbeappliedaccordingtoparticularapplicationsandoptimizationcircumstances,andCUDAGPUcardshadbeenoptimizedwith

highermemorybandwidthandfaston-chipperformance[19–21].Betteryet,itembeddedaseriesofAPIstoprogramdirectlyonGPUinsteadoftransferringapplicationcodestovariousobscuregraphicsAPIsandtheshadinglanguage

manually.InCUDAprograms,anyfunctionrunningonaGPUcardiscalled“kernel”,andlaunchingakernelwillgeneratemultiplethreadsrunningontheGPU[19].CUDAthreadsareorganizedintoahierarchystructure,i.e.,multiple

threadscomposeablock,andmultipleblockscomposeagridsuchasa2Dmatrix(seeFig.11)[22].WhenakernelrunningonaGPU,SMsareinchargeofcreating,managing,schedulingandexecutingCUDAthreadsingroups(called

warps),andeachgrouphandles32parallelthreads.However,thememorycapacityandparallelcomputingabilityofasingleGPUarestilllimitedtointegrallyprocessadatasetwithtensofgigabytes(GB)insizethatisthenormalsize

requiredforcurrentlarge-scalescientificapplications[23].Forexample,ameasureddatasetextractedfromthemicro-scalesurfaceareaofaceramicfemoralheadarticulateis49,152152××49,152insize(i.e.total1212GBdata

withsingleprecision),whereasasinglemodernGPUGeForceGTX1080selectedinthisstudyhasonly8GBGPUmemorycapacity,sothis1212GBdatasetcannotbeloadedandprocessedatasametime.Althoughthemoreexpensive

high-endGPUssuchasNVIDIATITANXp(1212GBmemory)andNVIDIATITANRTX(2424GBmemory)canstorelargerdatasets,theywillstillstrugglewithreal-lifemeasurementdatasetsoftencominginitshundredsandmoreGB

ofdatasizes.Moreover,theGPUcardshavefixedCUDAcores(e.g.,2560,3840and4608forGTX1080,TITANXpandTITANRTX,respectively),andeachonecanexecuteahardwarethread,sothreadsarealsofixedtosupport

certainnumberof concurrentcomputations.Consequently,a singleGPUmustdivide,andprocessmassive smallerchunksofadataset iteratively,whichmaygreatly increase the latencyof intensivecomputations. Inconclusion,

scientificcomputationswithbigdata,especiallyforapplicationsofonlineandhighprecisionsurfacetexturemeasurement,demanddistributedcomputationsonmulti-GPUs.Withcomplementaryoptimizationstrategies,multi-GPUs

canseamlesslyworktogethertoachieveamazingcomputationalperformance,butthemulti-GPUarchitectureismorecomplicatedthanthesingleGPU,anditbringstimeconsuminganderrorproneprocessesonworkloadpartitions

andallocations,theircorrespondingsynchronizationsandcommunicationsofdivideddatachunks,recombinationoftheprocesseddatachunks,anddatatransfersacrossPCI-E(PeripheralComponentInterconnection-Express)buses.

ItisworthmentioningthattheNVLinkconnectionsintroducedrecentlyhaveapproximate10timeshigherbandwidththanthePCI-E[24].However,NVLinksupportsthehigh-endGPUsonly(e.g.,TITANRTXandTESLAP100),anditis

normallyequippedonlyinthesupercomputers.

Historically,althoughplentifulandconcreteresearchesonwavelet-basedfilteringtheoriesarousedinthelastdecades,incomparison,theirpractical implementationsoncomputersachievedlimitedsuccessduetothekey

challengeofcomputationalcostsforwavelettransforms.Ingeneral, the intensivecomputationofDWTthroughits inherentmulti-leveldatadecompositionandreconstructionwillcausedrasticallyreductionof itsperformancefor

onlineapplicationswhenfacinglargedatasize.WiththerapiddevelopmentofGPUparallelism,severalsingleGPUaccelerationsolutionshavebeenexploredasresponsestothischallenge.Forexample,Hopfetal.accelerated2D

DWTonasingleGPUcardbyusingOpenGLAPIsin2002,andconsequentlythisresultedinthebeginningofGPUbasedDWTacceleration[25].Thisearlypracticeadoptedthetraditional“one-step”methodratherthantheseparable

twostepsof1DDWT,sothatthisprogrammingmodelwasnotflexibleandgeneralizableatall,i.e.,eachkindofwavelettransformmusthaveacustomimplementation.In2007,Wongetal.optimizedHopf'sworkbyusingshading

language(namedas“JasPer”),and JasPerusedshaders tohandlemultipledatastreams inpipelinesandhadbeen integrated in JPEG2000codec [26].Notably, JasPerdecomposes2DDWT into twostepsof1DDWT,such that it

supportsdozensofwavelettypesandboundaryextensionmethodsinconvolutionstage.Italsounifiedthecalculationmethodsofforwardandinverse2DDWT.ForWong'swork,duetothelimitationsonGPUprogrammabilityand

hardwarestructureatthetime,theconvolution,downandupsamplingoperationswereexecutedinsequenceonfragmentprocessorsofanearlyGPU,andeachtexturecoordinateintexturemappingmustbepre-definedinseparate

CPUprograms.ThesesolutionsrequiremanualmappingoperationsonGPUhardware,andtaskpartitioningthatdecideswhichpartoftheworkshouldbeconductedonaCPUandwhichpartwillbefitforaGPUisindispensablein

thesepracticespriortotheemergingofunifiedGPUlanguagessuchasCUDA.

Recently,therehasbeenagrowinginterestinCUDAbasedGPGPUcomputationstoparallelizetheDWTforscientificandbigdataapplications.Francoetal.parallelizeda2DDWTontheCUDAenabledGPUNVIDIATesla

C870forprocessingimagesandachievedasatisfactoryspeedupof26.81timesfasterthanOpenMPmethodwiththefastestCPUsettingatthetime[27].Songetal.acceleratedCDF(9,7)basedSPIHT(setpartitioninginhierarchical

trees) algorithm for decoding real-time satellite imagesbased on theNVIDIATeslaC1060,which achieved speedupof about 83 times compared to a single threadCPU implementation [28]. Later, Zhu et al. improved 2DDWT

computationonaGPUandverifieditwithacasestudyoftheringartifactremoval,whichachievedasignificantperformanceincrease[29].ComparedwithpreviousCUDAenabledGPUimplementationsofDWTthatonlyparallelizeda

veryspecifictypeofwaveletunderaspecificcircumstanceindividually,Zhu'sworksupportsdifferenttypesofwaveletsforadaptingdifferentapplicationconditions.Inaddition,thelatestconstantmemorytechniquehasbeenapplied

inZhu'swork to optimize the overallGPUmemory access.However, the sharedmemory access andCUDA streamsbasedhide latencymechanismwere ignored for further performance optimization in this research [19].More

importantly,thefirst-generationwavelettheoryanditscorrespondingalgorithmicoperationsappliedintheconvolutionschemeoftheseresearcheshadincreasedthecomputationalcomplexityandimposedagreatdealofmemoryand

Page 4: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

computationaltimeconsumption,e.g.,researchersneedtoconsiderthetroublesomeissueofboundarydestructioninherentintheFouriertransform[10].Thesecond-generationwavelettheoryadoptstheliftingschemeforwavelet

decompositionandreconstruction,whichpreservesallapplicationcharacteristicsofthefirst-generationwavelettransformwhilebringinghigherperformanceandlowermemorydemandduetoitsin-placealgorithmproperty[13].

ThesepreviousCUDAaccelerationstudieschosetheconvolutionschemeratherthanliftingschemebecausetheexecutionorderofliftingschemehasheavydatadependencies,suchthatitisimpossibletobefullyparallelizable.To

exploreCUDAbasedLWTimplementation,in2011,Wladimiretal.acceleratedtheLWTalgorithmwiththree-leveldecompositionsandreconstructionsofan19201920××1080imagebyusingCUDAversion2.1onNVIDIAGeForce

8800GTX768 MB,andthispracticeachievedaspeedupof10timescomparedwiththecorrespondingCPU(AMDAthlon6464××2DualCoreProcessor5200+)implementation[4].Recently,Later,Castelaretal.optimizedthe(5,3)

biorthogonalwavelettransformbyusingaCUDAinapplicationoftheparalleldecompressionofseismicdatasets[5].Unfortunately,thefurtherparalleloptimizationofLWTinthesestudieshasnotbeenconsidered.

PreviousworksfocusonparallelizingdataprocessingthroughusingasingleGPUthathadwitnessedamoderateperformancegainacrossboard.However,duetothelimitationofdatastorageformatandmemoryspace,as

well as the fixednumberofCUDAcores available ona singleGPUdie, previousworks are still struggling to fulfill theonlinedataprocessing requirements frommany large-scale computational applications,which is especially

problematicforprocessinglargemeasuredsurfacetexturedataengagingcomplexfiltrationswithvarioustoleranceparameters[1,30].Incomparison,multi-GPUbasedaccelerationsolutionscanbeflexibleandextensibleforachieving

higherperformancewithrelativelylowhardwarecosts.Numerouscomputational-intensiveissuesthatcannotberesolvedbyusingasingleGPUmodelhavebeenmakingsteadyprogressinthecontextofmulti-GPUs.Forinstance,FFT

(FastFourierTransform)basedmulti-GPUsolutionhasbecomethenorm incurrentonline large-scaledataapplications [31,32]. In themeantime,severalmulti-GPUprogramming libraries (e.g.MGPU) [33]andmulti-GPUsbased

MapReducelibraries(e.g.GPMRandHadoopCL)[34,35]hadbeendevelopedbyresearchers.Moreover,deeplearningwhichisverypopularinthecurrentdecadealsohasbenefitedfromthefastdevelopmentofmulti-GPUtechniques

[36,37].Thesepilotmulti-GPUstudiesarebasedontheassumptionsthatallGPUnodesequippedinamulti-GPUplatformhaveequalcomputationalcapacity.Inaddition,task-basedloadbalancingschedulersthattheseapproaches

relieduponfallshorttosupportapplicationswithhugedatathroughputsbutlimitedprocessingfunction(s)sincethereareveryfew“tasks”toschedule.Tomaximizethedataparallelisminaheterogeneousmulti-GPUsystem,this

studyhasalsodevisedaninnovativedata-orienteddynamicloadbalancing(DLB)modelbasedonanimprovedFNNforrapidmeasureddatadivisionandallocationonheterogeneousGPUnodes[38].

Insummary, thecombinationaccelerationoptimizationsmergingboththeparallelizedarchitectureofmulti-GPUsand improvements fromtheaspectsofLWTandsoftwareroutinesseamlesslytogether(e.g., thealgorithm

optimizationofliftingscheme,comprehensivememoryusageoptimizationbasedontheadvancedhierarchalarchitectureofCUDAmemory,andthelatencyhidemechanismbasedonCUDAmulti-streams)werelargelyignored.To

solvetheseshortcomings,thisstudyexploresandimplementsagenericparallelcomputationalframeworkfortheliftingwavelettransform(GPCF-LWT)basedonaheterogeneousCUDAmulti-GPUsystem.It furtheroptimizedthe

acceleration of 2D LWT through the improved processingmodel of lifting scheme (LBB and improved algorithm of lifting scheme), the devised three implementation optimization strategies and the improved DLBmodel on a

heterogeneousmulti-GPU system.Moreover, GPCF-LWT can support the whole kindwavelet transforms, thereby supporting intelligent wavelet transform selections in accordancewith different surface texture characterization

requirementsandmeasurementstrategies.

3Thesecond-generationwaveletThefundamental ideabehindwaveletanalysis istosimplifycomplexfrequencydomainanalysesintosimplerscalaroperations.Thefirst-generationwaveletappliesfastwavelettransformtoenabledecomposition(forward

DWT) and reconstruction (inverseDWT) in the formof a two-channel filter banks, or the so-calledMallat convolution scheme [39].Mallatwavelet decomposition and reconstruction convolve the input and output datawith the

corresponding decomposition and reconstruction filter banks respectively, so it has become a popular choice formany engineering applications in recent years.However,Mallat algorithmhas a certain degree of computational

complexity(e.g.ithasinherentboundarydestructionwhenexecutingFFT).Thus,SweldensproposedanewimplementationofDWT,i.e.liftingschemeorthesecond-generationwavelet.

The second-generationwavelet that uses the lifting scheme forwavelet decomposition and reconstruction preserves all properties of the first-generationwavelet, but brings higher computational performance and lower

memorydemandbecauseofitsin-placealgorithmattribute[13].Theliftingschemebased1Dforward(i.e.decomposition)DWT(abbreviatedasforwardLWTorFLWT)containsfouroperationsteps:split,predict,updateandscale

[4,13,40]:

1) Split:Thisstepsplitstherawsignalintotwosubsetcoefficients,i.e.evenandodd,andtheformercorrespondstotheevenindexvalueswhilethelatteristheoddindexvalues.ThesplitmethodisexpressedinEq.(1),anditisalsocalledas“lazy

wavelet”.

2) Predict:Forthecoefficientsofevenandodd,itcanpredictoddcoefficientsfromtheevenbyusingthepredictionoperatorPO,andthenreplacetheoldoddvaluesbythepredictedresultasthenextnewoddcoefficients,recursively,andthisstep

(1)

Page 5: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

canbeexpressedasEq.(2).TheoddisthesameasthedetailcoefficientsDinMallatalgorithm.

3) Update:Similarly,itcanupdateevenbyusingtheupdateoperatorUO,andthenreplacetheoldevenvaluesbytheupdatedresultasthenextnewevencoefficients,recursively,andthisstepcanbeexpressedasEq.(3).TheevenisthesameasA

inMallatalgorithm.

4) Scale:NormalizeevenandoddcoefficientswithfactorKrespectivelybyusingEq.(4)togetthefinalapproximationcoefficients(evenApp)andthedetailcoefficients(oddDet).

Alloperatorsarecompletedinthein-placecomputationalmodebasedonthesefoursteps.Fig.1illustratesthesplitoperatorwithin-placemode,inthiscase,therawdatashouldbesplitintoevenandoddsubsetswhichare

organizedandstoredinthefirsthalfandthesecondhalfofthememoryspacerespectivelythatisthesamespaceforrawdatastorage.Similarly,prediction,updateandscaleresultsarestoredintheiroriginalmemoryspacesrather

thanallocatingnewmemories,suchthatliftingschemerequireslessmemoryconsumptionthantheclassicMallatalgorithm.Furthermore,the1Dinverseliftingscheme(i.e.reconstruction)DWT(abbreviatedinverseLWTorILWT)just

inversethecomputationalsequenceofoperationstepsofFLWTandswitchthecorrespondingplusandsubtractionoperators:

Scale:

Update:

Predict:

Merge:Thisstepisalsocalledas“inverselazywavelet”,anditcanbeexpressedby:

Theliftingschemecandynamicallyupdateconfigurationparameters(i.e.iterativenumberandwaveletfiltercoefficients)ineachcomputationalstepofDWTbasedonfactorizationsofthepolyphasematrixesofwaveletfilter

coefficients,suchthatthegeneralizationofthisframeworkcanbeguaranteedtorealizetheparallelcomputationsforallkindsofwavelettransformsdynamically.Generally,thefactorizationofpolyphasematrixP(z)forallwaveletscan

bedescribedasthefollowing:

wherecoefficientsofui(z)areupdatecoefficients in theupdatestepandcoefficientsofpi(z)arepredictioncoefficients in thepredictionstep inFigs.2and3,andK is the scale factor. In a conclusion, the computational results

(2)

(3)

(4)

Fig.1Thecomputationofsplitoperationwithin-placemode.

alt-text:Fig1

(5)

(6)

(7)

(8)

(9)

Page 6: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

ofbothforwardandinverseLWTforarbitrarywaveletcanbeobtainedbyapplyingseveralstepsofpredictionandupdateoperationsandimposingthefinalnormalizationwithKontheoutputsoffactorizationofP(z).Figs.2and3

illustratethemaincomputationalprocessesofsingle-level1DforwardandinverseLWTrespectively.

4CUDAbasedGPCF-LWTimplementationBasedon the fundamentalLWTconcepts summarized inSection3, this studyproposesGPCF-LWTbyusingCUDAenabledGPUs for online andhighprecision surfacemeasurement applications.This studyassumes that

factorizationsofpolyphasematrixesofathewaveletfiltercoefficientshavebeenpreparedbeforeapplyingtheGPCF-LWT,sothatwehavegotaseriesofupdatecoefficientsU = {u0,u1,… ,un},aseriesofpredictioncoefficients

P = {p0,p1,…,pn},andalsothescalefactorK.

Thetraditionalparallelcomputationalplatforms(e.g.OpenMP,MPIandFPGA)haveempoweredthreeclassicLWTparallelcomputationalmodels,namely,theRC(Row-Column),LB(Line-Based)andBB(Block-Based)models

[41].Generallyspeaking,theLBmodelhasthebestinstructionthroughput,followedbytheBBandRCmodels[4].Unfortunately,exceptfortheshortestwavelets(e.g.,theHaarwavelet),bothLBandBBmodelshadfailedtobe

transplantedonthesingleCUDAGPUduetothespacelimitationofcurrentsharedmemory,andthegeneralizationofLWT(i.e.,supportingallkindsofwavelettransforms)alsocannotbesatisfied.AlthoughRCmodel istheonly

feasibleCUDAGPUtransplantationthatsupportsthegeneralizationofLWT,theprocessingtimeofCUDAbasedRCmodelisapproximately4timesslowerthantheMallatalgorithmsolutionrunningonthesameGPUcard[42].Thus,

basedonthein-depthstudyofthesethreemodels,thisstudydevisedagenericLWTparallelcomputationalmodelfortheCUDAenabledGPUarchitecture.

4.1Theimprovedparallelcomputationalmodel• TheLBmodel.Ontraditionalparallelcomputationalplatforms(e.g.,OpenMP,MPIandFPGA),theoverallframeworkofLBbasedLWTisdepictedinFig.4.TheLBmodeldividesarawdataset(e.g.,1024 × 1024insize)intoN(Nequalstothenumberofcomputational

nodes)smallersubsets(e.g.,ifNis4,eachsubsethas256rowsand1024columns),andeachsubsetwillbeallocatedonacomputationalnodewhereLBperformsthefollowingoperations:

1) Lunchingmthreads,andeachoneperformshorizontal1DLWTforeachrowofthetargetsubsetconcurrently(i.e.,eachthreadprocessesarowofthetargetsubset.)andthentheLBcachesthetemporaryresultsinthelocalmemory(e.g.,aRAM—randomaccessmemory);

2) Oncethemrowsarecompletelyprocessed,themthreadsperformvertical1DLWTonmcolumnsoftemporaryresultsconcurrentlyandrepeatedlyuntilallcolumnsofthecachedtemporaryresultshavebeenprocessed.Inthisstep,theiterativetimesare(width + m–1)/m,andeachiteration

processesmcolumns.

3) Steps1and2maintainanintegratediterationthatwillberepeateduntilallrowsandcolumnsofthetargetsubsetareprocessed.

Fig.2Themaincomputationalprocedureofsingle-level1DforwardLWT.

alt-text:Fig2

Fig.3Themaincomputationalprocedureofsingle-level1DinverseLWT.

alt-text:Fig3

Page 7: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

ThelaststepofLBbasedLWTisthatonenode(usuallycalledamasternode)mergestheresultsofNsubsetsandthenobtainsthefinalresult.Theminimumvalueofmisdeterminedbyaparticulartypeofwavelet,e.g.,3forHaarand6forDeslauriers-

Dubuc(13,7).Generally,mcanbeunifiedasafixedvalue(e.g.,10or20)thatlargerthanalltypesofwaveletsneeded.Takingm = 10forafloatdatasethaving1024columnsasanexample,eachcomputationalnoderequires40 KBlocalmemoryspace,soLBmodel

canbeeasilyimplementedonOpenMP,MPIandFPGAsincetheyhavelargeRAMspaces.However,aCUDAblockhasonly16 KBlocalmemory(sharedmemory)spacethatcannotsatisfythememorydemandofLBmodelexceptfortheshortestwavelet(e.g.,Haar

waveletrequiresonly12 KBsharedmemoryspace).Inaddition,columnsofadatasetcannotbedividedbyLB,soitisacoarse-graineddivisionapproach,whichcannotoperatewellwiththeSIMDofCUDA.

• TheBBmodel.UnlikeLB,theBBmodeldividesarawdatasetintoNsubsets(datablocks)frombothrowandcolumndirections,seeFig.5.EachsubsetisallocatedonacomputationalnodewhereRCroutineperformshorizontal1DLWTforallrowsconcurrentlyat

first,andthenvertical1DLWTforallcolumnswillbecomputedconcurrently.LikeLBmodel,themasternodewillmergeallresultsofsubsetsatthelaststepofBBmodel.TheBBmodelisalsoacoarse-graineddivisionapproach,anditalsorequiresagreatdealof

localmemoryspace,e.g.,ifarawfloatdatasetwith1024 × 1024insizeisdividedintofoursubsets(eachsubsethas512rowsand512columns),eachcomputationalnoderequires1024 KBlocalmemoryspacethatconsiderablyexceedthesizeofasharedmemory

spaceofaCUDAGPU,suchthatclassicRCmodelbasedCUDAsolutionsmustuseaGPUglobalmemoryspaceasthetemporarycaches,andithasbecomethekeybottleneckfordataaccess.

• TheinnovativeLBBmodelforCUDALWT.TosolvetheproblemofsharedmemoryshortageonCUDAGPUs,thisstudydevisedaninnovativeparallelcomputationalmodelfor2DLWT-LBBbycombiningbothLBandBB.ThemainoverallframeworkofLBBmodelis

shownasinFig.6.TheLBBmodeldividesarawdatasetintoseveralsubsetsfromthecolumndirection,andeachsubsetcanbeassignedwithaCUDAthreadblock(e.g.,8rowsand32columnsinsize)wherethein-builtthreadsperformthefollowingoperations:

1) EachgroupofconcurrentrowthreadsloadsarowofadatasetintothesharedmemoryofaGPUinparallel.Accessesofarowdatacanbecoalescedsinceallthreadsinablockaccessthecontinuousrelativeaddresses[19],soitisveryefficienttoloadadatasetintoasharedmemoryspace.

2) Eachgroupofconcurrentrowthreadsinablockperformshorizontal1DLWTonadatarowinparallel,andthentheGPUcachesthetemporaryresultsinitssharedmemoryspace.

3) mgroupsofrowthreadsgetmrowtemporaryresultsinparallelbyexecutingsteps1and2(misapresetvalue,anditis8here).

4) Eachgroupofconcurrentcolumnthreadsinablockperformsvertical1DLWTonadatacolumnofthecachedtemporaryresultsinparallel,andwidth/ngroupsofcolumnthreadsprocesswidth/ncolumnsofcachedtemporaryresultsinparallel(width/nisthewidthofasubset).

5) ThecorrespondingcolumnthreadgroupssavetheresultsintoaGPUglobalmemoryspaceanddeletethemrowtemporaryresultsinthesharedmemoryspacegotbystep3.

6) Steps1–5maintainacompleteiterationthatwillberepeateduntilallrowsandcolumnsofadatasetareprocessed.

ThetwodifferencesofdatadivisionmethodsforBBandLBBare:1)InordertoadapttheadvancedarchitectureofCUDAthreads,theheightofasubsetinLBBislargerthanBBwhereasthewidthforLBBissmallerthanBB,andthenumberofsubsetsof

LBBisfarlargerthanBBasthenumberofthreadblocksisfarmorethancomputationalnodesinthosetraditionalparallelplatforms.Steps1–6ofLBBalsointegratedtheLBideawiththeonemajorupdate:Insteadofprovidingathreadforarow,LBBconstructsthe

threadblockstructurebasedonnumerousCUDAthreads(eachthreadblockorganizemgroupsofconcurrentrowthreads,andeachgroupofrowthreadhaswidth/nCUDAthreads,e.g.32CUDAthreadsineachrow),suchthatathreadblockcanperformboth

horizontal1DLWTonmultiplerowsbyusinggroupsofrowthreadsandvertical1DLWTonmultiplecolumnsbyusinggroupsofcolumnthreadsinparallel.

AlthoughLBBdividesalargedatasetintoseveralsmallersubsets,theproblemofsharedmemoryshortageisstillnotable,e.g.,itrequires64 KBsharedmemoryspaceifasubsethassize128 × 128.Thus,theslidingwindowstructurehasbeendesignedin

LBB,seeFig.6.InLBB,thesizeofaslidingwindowisequaltoaconstructedthreadblock(e.g.8 × 32forasubset128 × 128),i.e.,aslidingwindowisbuiltonathreadblock,andeachsubsethasaslidingwindow.Onceaslidingwindowisfull(e.g.,thereare8rows

oftemporaryresultscachedinasharedmemoryspace),column-basedvertical1DLWTwillbecarriedout(i.e.steps4and5),andthenLBBmovesdownaslidingwindowonthetargetsubsettocomputeandcacheotherdatapoints(orpixels)ofotherthreadblocks

concurrentlybymovingthethrowblocktoprocessthenext8rows(startingfromStep1).

• Performanceanalysis.TakingtheCDF(9,7)waveletasanexampletoevaluatetheCGMA(ComputetoGlobalMemoryAccess)(i.e.,anindextoestimatetheradioofinstructionthroughputandglobalmemoryaccessesforperformanceanalysis[22])withLBB,it

requires18timesofcomputationaloperationsforperforming2DDWTonadatapoint,including16timesofmultiplicationsand2timesofadditions,and4timesofglobalmemoryaccesses(2timesofreadandwriterespectively),byassumingthatthesizeofaslide

windowis8 × 32andeachthreadprocessesadatapointinparallel,suchthattheCGMAis:

ComparedwiththeRCmodel(CGMAis0.9),LBBhasbeenincreasedby10times.

(10)

Page 8: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

4.2TheimprovedliftingschemeInaGPUroutine,theLWTrequireslotsofsynchronizationsaseveryimplementationprocedureoftheliftingschemehasdifferentfunctionscomputinginsequence,anditdecreasestheparallelizabilityofLWTcomputations.To

solvethisproblem,thisstudydevisedanimprovedapproachtodecreasetheinherentsynchronizationsofliftingscheme.ThefactorizationofpolyphasematrixP(z)(seeEq.(11)Eq.(9))canbeconvertedintothefollowingform:

wheren = mifu0(z) = 0andn = m-1ifu0(z)≠0.AccordingtoEq.(13)Eq.(11),theoptimizedoperationstepsoftheliftingschemebased1DforwardLWTareshownasthefollows:

Fig.4TheoverallframeworkofLBmodel.

alt-text:Fig4

Fig.5TheoverallframeworkofBBmodel.

alt-text:Fig5

Fig.6TheoverallframeworkofLBBmodel.

alt-text:Fig6

(11)

Page 9: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

(1) Split:

(2) Preprocessing:

(3) Lifting(predictandupdate):fori = 1,…,n,then:

(4) Scale:

Thestep3isperformediteratively.However,synchronizationoperationsstillexistinstep3sincethecomputationofsicanonlybeginafterthedicomputationduetosireliesondi.Thus,Eq.(16)Eq.(14)shouldbemodifiedasthe

followsing:

ItcanbeseenfromEq.(18)Eq.(16),thecomputationofsinolongerreliesondi,andsianddionlyrelayonsi-1anddi-1respectively,sothatthecorrespondingcomputationaloperationscanbeperformedconcurrentlytodecrease

halfofthesynchronizationsrequiredinliftingscheme.Themaincomputationalprocedureofsingle-level1DforwardLWTisdemonstratedinFig.7thatisoptimizedfromFig.2.TheinverseLWThasthesimilaroptimizationapproach.

4.3TheconstructionofthesingleGPU-basedGPCF-LWTBeforerealizingparallelcomputationsof thesingle-level1DLWTbyusing theGPCF-LWT,a seriesofpredictioncoefficientsP = {p0,p1,… ,pn}andupdate coefficientsU={u0,u1,… ,un}must be generated successively

accordingtothegivennameofawaveletinordertoensurethegeneralizationoftheGPCF-LWT.PandUare2Darrayswiththesamelengthn,andP[i]andU[i]correspondtopredictionandupdatecoefficientsfortheithpredication

andupdateoperationsrespectively.WithintheGPCF-LWT,PandUserveasinputparametersreflectingdifferentwavelettypes,suchthatGPCF-LWTcansupportthewholefamilyofLWT.BasedonLBBandtheimprovedliftingscheme,

Fig.8illustratesthemaincomputationalprocedureofmulti-level2DforwardLWTproposedinthisstudy.ItneedstogenerateaseriesofP(a2Darray)andU(a2Darray)accordingtothegivennameofawavelet,andthencopythese

predictionandupdatecoefficientstogetherwiththeraw2DsignaldatafromaCPUmemoryspacetoaGPUglobalmemoryspace,andaspecificdesignedCUDAstreamoverlappingstrategyforincreasingtheconcurrencyofthisstep

(12)

(13)

(14)

(15)

(16)

Fig.7Theimprovedliftingschemeforsingle-level1DforwardLWT.

alt-text:Fig7

Page 10: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

ispresentedinSection4.4.InFig.8,theoutputresultscA,cH,cV,andcDcorrespondtotheapproximationcoefficientsanddetailcoefficientsalonghorizontal,vertical,anddiagonalorientationsrespectively.Multi-level2DLWTapplies

theapproximationcoefficientscAasa2Drawinputsignal forthenextcomputational level,andthenanew2DforwardLWTiterationwiththesamecomputationalstepsandcomputationalparametersas itsprevious levelwillbe

started on this level. In thisway, this procedurewill be executed recursively until the expected level has reached, and the final results are the cA of the last level and all detail coefficients cH, cV, and cD collecting from each

computationallevel.ThecomputationalframeworkofLBBbasedinverseLWTcanbeconstructedinthesimilarwayofFig.8.

Inthiscase,arawsurfacemeasureddatasetwillbedividedintoseveralsmalldatachunkswhichwillbefurtherparallelizedbyapplyingthepipelineoverlappingstrategy(seeSection4.4).Ahostfunction(i.e.,theforwardand

inversescheduler)executingonaCPUisresponsibleforlaunchingthePredict()andUpdate()kernelsonaGPUiterativelyaccordingtotheseriesofpredictionandupdatecoefficients.Fig.9showsthata2DLWTonGPCF-LWThasbeen

realizedthroughseparatingitintotwo1DLWTfromtherowandcolumnorientationsrespectively,andmulti-levelLWThasbeenimplementedbyapplyingthesingle-levelLWTrecursively.Oncethefinalresultsareobtained,inverse

LWTcanbeperformedtoreconstructthefiltratedengineeringsurfaces(e.g.roughnessandwaviness).

GPCF-LWTsupportsthewholefamilyofwavelettransformsbasedonLBBandtheimprovedliftingscheme.Thispaperpresentstwoexamples(HaarandCDF(9,7)wavelets)toshowtherealizationofLWTwithanyspecific

Fig.8ThemaincomputationalprocedureofimprovedliftingschemewithLBBformulti-level2DforwardLWT.

alt-text:Fig8

Fig.9AnimplementationoverviewoftheGPCF-LWTonasingleGPUcard.

alt-text:Fig9

Page 11: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

waveletbyusingGPCF-LWTgenerically.OneofThefactorizationsofpolyphasematrixesofwaveletfiltercoefficientsconformingtotheimprovedliftingschemecanbewrittenastheEqs.(19)Eqs.(17)and(20)(18)forHaarandCDF(9,7)

respectively.

AccordingtoEqs.(19)Eqs.(17)and(20)(18),P,UandKforHaarandCDF(9,7)respectivelycanbeacquiredfromthefollowingEqs.(21)Eqs.(19)and(22)(20).

ThecoefficientsU0ofbothHaarandCDF(9,7)arezero,sotheoptimizedcomputationalprocedureofsingle-level1DforwardLWTcanbedevisedasFig.13Fig.10whereahost function isresponsible forschedulingthe

executionsofpredictandupdatekernelsandprovidingdifferentconfigurationandcomputationalparametersfordifferenttypesofwavelets,andthecorrespondingpredictandupdatekernelsareexecutedonaGPUcardinparallel.

TheleftsideofFig.10illustratesthecomputationalprocedureofHaarwavelettransform,andCDF(9,7)waveletisillustratedontherightside.TheinverseLWTimplementationsofHaarandCDF(9,7)aresimilarastheforward

implementations,whichjustinversethecorrespondingsequencesofoperationstepsofforwardcomputationsandswitchthecorrespondingplusandsubtractionoperators.

(17)

(18)

(19)

(20)

Fig.10Theschedulingalgorithmofasingle-level1DforwardLWT.

Page 12: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

4.4ThreeoptimizationstrategiesforGPCF-LWTimplementationonasingleGPUcardTheGPCF-LWTfurtherexploresthreeimplementationoptimizationsonasingleGPUbyseamlesslyintegrateingtheGPCF-LWTimplementationwiththelatestCUDAGPUhardwareadvancementstomaximizetheparallelization

ofgeneric2DLWTcomputations:

1) GPCF-LWTprovidesanoptimizationstrategythatenabledynamicconfigurationsofexecutionparametersofkernelsbasedontheCUDAthreadhierarchy.InCUDAprograms,athreadblockmustbeloadedintoaSMinseparably,andthreadsinablockwillbefurther

encapsulatedintoseveralwarps.ASMexecutesallitswarps(e.g.,4warpsinFig.11)conformingtoSIMD(SingleInstruction,MultipleData)model,suchthatallthreadsinawarpsharesameprograminstructionsbuttoprocessdifferentdataatthesametime.

AnotherinactivewarpwillbeactivatedfromtheinactivestatewhenanactivewarpinthesameSMispending(e.g.,awarpwillbependingwhenitneedstoaccessdatafromtheglobalmemoryasitgenerallyrequires400–600cycles)inordertohidememorylatency

[20,21].Thus,itisofvitalimportancetoconfigurethenumberofCUDAthreadsinablockandorganizetheminanappropriatehierarchystructure(i.e.setupapplicableblockandgridsizes)foreachkernel,suchthatSMscanhavejustenoughwarpstoschedulewhile

limitingthecomplexityofthreadmanagement.Ideally,thenumberofthreadsinablockmustbeanintegralmultipleof32tomatchthefactthateachwarphas32threadsfor32CUDAcoresinCUDAGPUs,soeachSMcanbepossiblyarrangedwithjustenough

numberofwarps.Inpractice,SMmayreducethenumberofblocksduringruntimewhentoomanyCUDAthreadsareorganizedinablockandeachthreadoccupiestoomanyregistersoralargesharedmemoryspace,anditoftenleadstothevitalproblemthatthere

arenotenoughwarpstobescheduled.Withtheseconsiderations,thisstudyhadtestedandevaluatedtheunderlyingrelationshipbetweenthedifferentnumberofthreadsinablockandthehardwareconsumptionofGPCF-LWT,whichfoundthattheoccupancyof

CUDAcoresinaSMcancomeupto100%whenweconfigure256threadsperblockintheGPCF-LWT.Thus,theblocknumber(gridsize,bn)ofLWTisbn = (sn+256–1)/256wheresnisthelengthofinputdata,suchthatthebnwitheachblockhaving256threadscan

keepallSMsofaGPUcardbusyallthetime,andhidethedelaycausedbydatatransfers.Totestandevaluatethevalidityandpracticalityofthisimplementation,thisstudyhascomparedtheeffectsofdifferentnumberofthreadsinablocktofindtheinherent

correlationbetweenthenumberofconfiguredthreadsinablockandtheprocessingtime.Fig.12liststheprocessingtimefordifferentnumbersofthreadsinablockfor4-level2DLWTwithdatasizeof4096 × 4096runningonasingleGTX1080GPUcard(the

hardwarespecificationcanbeseeninTable3).Fig.12showsthatGPCF-LWTcangaintheminimumprocessingtimewhenthenumberofthreadsinablockissetto256.

2) GPCF-LWThasdevelopedahierarchicalmemoryusageschemeforgenericLWTcomputationsbasedontheCUDAGPUmemoryhierarchy.CUDAadoptsaSPMD(SingleProgram,MultipleData)programmingmodelandprovidesasophisticatedmemoryhierarchy(i.e.

register,localmemory,sharedmemory,globalmemory,texturememoryandconstantmemory,etc.).Hence,aGPUcanachievehighdataaccessesthroughelaboratelydesignedCUDAcodesempoweredbytheefficientusagesofdifferentmemoriesaccordingtothe

respectivedatafeatures,includingaccessmode,sizeandformat.GPCF-LWTcallstheCUDAAPIfunctioncudaHostAlloc()ratherthanmalloc()toallocateaCPUhostmemoryinpage-lockedmemoryorpinnedmemorymodethatcansupportDMA(DirectMemory

Access)withoutrequiringtoallocateadditionalbuffersforcachingdata.Comparedwithpage-ablememorymode,thepage-lockedmodetransfersdatadirectlyfromaCPUhostmemoryspacetoaGPUglobalmemoryspace,sothewritespeedofitis1.4timesfaster

andreadspeedis1.8timesfasterthanpage-ablemode[21].However,thepinnedmemoryshouldnotbeoverusedforitsscarcitythatleadstoreducedperformance.Withthisinmind,thisstudypre-allocatesafixedsizepinnedmemoryintheinitializationstageand

reuseitduringthewholecomputationallife-cycle,suchthatitmaintainshighmemoryperformancethroughout.Moreover,GPCF-LWTusesthesharedmemoryofaGPUforexecutingkernels,andinsteadofdirectdataaccessesfromaGPUglobalmemory,asingledata

pointwillbeloadedintoasharedmemoryspaceofaSMonceandonlyoncefromtheglobalmemory,andthenanyloadeddatapointcanbeaccessedbyseveralCUDAthreadssimultaneouslyinaveryfastspeed(generallyitonlyrequires1–32cycles).Incomparison,

dataaccessspeedofaglobalmemorywillbedramaticallyreducedwhenseveralthreadsaccessthesamedatapointinaglobalmemoryspacerepeatedly,andregisterswerealsonotappliedhereforthesimilarreason[20].Anothermajorreasonforusingtheshared

memoryisthatrowaddressesinaglobalmemoryarealigned,suchthattheoperationthat32CUDAthreadsinawarploadingdatacontinuouslyfromtheglobalmemorycanbecoalesced,i.e.,loading32datapointsintoawarpneedstheglobalmemoryaccessonly

onceratherthan32accessesinsequence,sothiscoalescingcanenhancethedatatransferbandwidth.AnotherimportantissueisthatwheretostoreandaccessPandU,theconstantmemoryortheglobalmemory.Althoughboththeconstantmemoryandglobal

memoryareinthesamephysicallocationoftheGPUhardwarearchitecture,thetimeconsumptionforaccessingadatapointfromaconstantmemorycanbemuchfasterthanaglobalmemoryduetoitsinherentcacheandbroadcastmechanisms.Forinstance,adata

pointcanbeloadedfromthecacheofaconstantmemorywhenitisacachehit,anditjustcostsonecycleingeneral.Broadcastisanotherveryusefultooltoreducedataaccessdelay,whichmeansthatadatapointloadedbyathreadwillbebroadcastedtoallthe

otherthreadsofawrap.Thus,accessingadatapointfromaconstantmemorycanbeasfastasaregisterwhenallthreadsofawarpaccessthesamedatapoint[21,43].Inthisstudy,allCUDAthreadsusethesamevaluesofPandU,andtheyareconstantandwillnot

bechangedduringperformingPOandUO,suchthatthefastdataaccessescanbeachievedbystoringandaccessingPandUsetsinaconstantmemoryspace.Allinall,thememoryhierarchyofGPCF-LWTforgenericLWTcomputationscanbeorganizedasFig.13.

TotestandevaluatethevalidityandpracticalityofthishierarchicalmemoryusageschemeofGPCF-LWT,thisstudyanalyzedtheperformancefactorsinfluencingthe2DLWTcomputationondifferentmemoryspaces.Firstly,thisexperimenttestedand

comparedthebandwidthbyusingthepinnedandpage-ablememorymoderespectivelybytransferring1GBmeasureddatafrombothH2D(fromaCPUhostmemoryspacetoaGPUglobalmemoryspace)andD2H(fromaGPUglobalmemoryspacetoaCPUhost

memoryspace)directions.InFig.14,thebandwidthofpinnedmemorymodegainsapproximately55%improvementcomparedwiththepage-ablememorymode.Then,theexperimenttestedandcomparedthecomputationalperformancesbetweentheproposed

hierarchicalmemoryusageschemeandthepureglobalmemoryusage.Inthepureglobalmemorysolution,alldata(rawmeasureddatasets,P,U,intermediateresultsandthefiltrationoutputresults)arestoredintheglobalmemory.Incontrast,theproposedscheme

storesdifferentdatasetsindifferentmemoryspaces,i.e., itappliestheglobalmemoryspacetostoretherawmeasureddatasetsandthefiltrationresults,andthesharedmemoryspacetocachesubsetsandthetemporaryresults,whichputPandUcoefficient

instancesintheconstantmemoryspace.Fig.15illustratestheprocessingtimesofthesetwosolutions,anditcanbeseenfromtheexperimentalresultsthattheproposedschemegainsabout50%computationalperformanceenhancement.

3) GPCF-LWTenablesapipelineoverlappingstrategybyusingCUDAstreams.CUDAGPUprovidesanon-blockingasynchronousexecutionmodeforGPUkernels,suchthatthehostcodesonaCPUandCUDAcodesonaGPUcanbeexecutedconcurrently.Basedonthis

alt-text:Fig10

Page 13: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

consideration,thisstudyderivesaCUDAstreamoverlappingstrategythatcaneffectivelyalleviatethedelayimposedbyschedulingthesequentialorderofexecutionoftheliftingscheme.EachpipelinemanagesaCUDAstream,andapipelinecanorganizenkernels

runningonitindifferenttimeslices,i.e.,apipelinewillactiveandrunanotherkernelautomaticallywhenakernelispending[19].PureCUDAprogramsbasedonthenon-overlappingmodelarevariantsofthefollowingfundamentalthreestages:(1)transferringaraw

datasetfromaCPUhostmemoryspacetoaGPUglobalmemoryspacethroughPCI-Ebuses(i.e.,H2Ddatatransfers);(2)processingtherawdatasetviaaspecificalgorithm—GPUkernelexecutions(e.g.,2DLWT);(3)transferringoutputresultsfromtheGPUglobal

memoryspacetotheCPUhostmemoryspace(D2Hdatatransfers).Obviously,bothstagesofdatatransfers(i.e.H2DandD2H)andkernelexecutionsfordataprocessingaretime-consuming.Moresignificantly,thesethreetime-consumingstagesareperformedin

sequenceinthenon-overlappingmode,andthesequentialtimelinesforthreepipelinestagesareillustratedinFig.16.Thus,datatransfersinFig.16areblockingtransfersandthehostfunctionrunningonaCPUcanonlylaunchGPUkernelsaftercompletionofthe

correspondingdatatransfers,sothatboththePCI-EbusesandGPUwastealotoftimeinidlestateduringtheentireworkflow,whichsharplydecreasestheprocessingperformanceofCUDAprograms.Toreducetheseidletimeslices,GPCF-LWToverlapsthethree

majortime-consumingstagesbyusingCUDAstreamsinthepipelines.InaccordancetothehardwaredevelopmentthatbothGPUandPCI-Esupportbidirectionaldatatransfers,eachpipelineinGPCF-LWThandlesarepeatedprocessoftransferringasmalldatachunk

toaGPUmemoryfirstandthenlaunchingLWTkernelstoprocessthisdatachunkuntilthewholedatasetisprocessed.Fig.17showsthetimelinesoftwodatatransferstagesandLWTkernelexecutionstagewiththeoverlappingmodel,inwhichLWTkernelscanbe

performedduringstagesofdatatransfers(i.e.,thedatatransfersarehided),suchthataGPUcanalwaysstayinthebusystateduringtheentiredataprocessingworkflowtogainhigherperformancethanthenon-overlappingmodel.Thisstrategycanbeextendedto

hidedatatransferdelaysinmulti-GPUsystemsgracefully.TheCUDAAPIcudaMemcpyAsync()hadbeenappliedinthisstudytoasynchronouslytransferdatafromaCPUhostmemorytoaGPUglobalmemoryinordertorealizetheoverlappingbetweendatatransfers

andkernelexecutions.ThepseudocodesofthedevisedoverlappingstrategyarelistedinAlgorithm1,andtheLBBbased2DLWTkernelislistedinAlgorithm2.Moreover,thisaggregatedstrategystillappliesonthesystemsusingNVLinktechnologywithtwofeatures:

(1)bothH2DandD2Hdatatransfersareefficient,sostage1and3consumefarlesstime;(2)thesetwostagescantransferarelativelargedatachunkeachtime,suchthatstage2maintainshighconcurrency.Withthesenewfeatures,theGPCF-LWTcanachievebetter

performancewithoutextraeffort.

Fig.11ThehierarchystructureofCUDAthreads.

alt-text:Fig11

Fig.12Processingtimesof2DLWTonGPCF-LWTfordifferentnumbersofthreadsinablock.

alt-text:Fig12

Page 14: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

Fig.13ThehierarchicalmemoryhierarchyusageofGPCF-LWT.

alt-text:Fig13

Fig.14Thebandwidthofpinnedandpage-ablememorymoderespectively.

alt-text:Fig14

Fig.15Theperformancecomparisonofthetwosolutions.

alt-text:Fig15

Page 15: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

Algorithm1Overlappingofthreestages—twodatatransfersandkernelexecutions.

alt-text:Algorithm1

Inputs:a2DrawdataData[][],andrownumberrows,columnnumbercols.

1stream<-cudaStreamCreate(&stream[i]) //createnstreams

2a_h = &Data[0][0]            //getthestartaddressofData[][]

3size=N*cols/nStreams         //calculatethedatasizeforeachstream

4fori = 0tonStreams−1do

5  offset = i*N/nStreams        //calculatetheaddressoffsetforastreampipeline

6  cudaMemcpyAsync(a_d+offset,a_h+offset,size,H2D,stream[i])//transferdatatoaGPUintheithstream

7  LBB_LWT<<<…,stream[i]>>>(a_d+offset) //launchawavelettransformkernelintheithstream

8  cudaMemcpyAsync(a_d+offset,a_h+offset,size,D2H,stream[i])//transferdatatoaCPUintheithstream

Algorithm2LBBbased2DLWTkernel.

alt-text:Algorithm2

Inputs:a2DrawdataData[][],andrownumberrows,columnnumbercols.

Outputs:cA,cH,cV,andcD.

1  //foreachslidewindow

2   for(row=0;row<height;row+=SLIDE_HEIGHT)do

3     __shared__floatsData<-loadrawdatafromaGPUglobalmemoryspacetoaGPUsharedmemoryspace

4     //eachgroupofrowthreadsperformshorizontal1DLWTinparalleltoprocessasubsetandcaches

Fig.16ThetimelinesofthedatatransfersandLWTkernelexecutionsinthenon-overlappingmodel.

alt-text:Fig16

Fig.17Thetimelinesforoverlappingofthreepipelines.

alt-text:Fig17

Page 16: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

     //thetemporaryresults

5     even,odd<-split(sData);

6     Preprocessing(even,odd);

7     __syncthreads();  //synchronizationoperations

8     predict(even,odd);

9     update(even,odd);

10    __syncthreads();

11    //eachgroupofcolumnthreadsperformsvertical1DLWTinparalleltoprocessthetemporaryresults

12    even,odd<-split(sData);

13    Preprocessing(even,odd);

14    __syncthreads();

15    predict(even,odd);

16    update(even,odd);

17    __syncthreads();

18    save(even,odd,output);  //saveoutputresultsbacktoaGPUglobalmemoryspace

TotestandevaluatethevalidityandpracticalityofthispipelineoverlappingstrategyforLWTcomputations,thisstudycomparedtheperformancebetweentheoverlappingandnon-overlappingimplementationsbyusinga

singleGTX1080GPUcardandperforming4-level2DLWTwithD4wavelet(datasize4096 × 4096).Thisexperimentrecordedtheprocessingtimeofeachstep,i.e.,H2D,dataprocessingonaGPUcard(CUDAkernelexecutions),and

D2H.BasedontheexperimentalresultslistedinTable1Table3,itcanbeseenthattheoverallcomputationaltimeofnon-overlappingimplementationisthesumoftimeconsumptionsofthesethreestepssincethesestepsareperformed

inasequentialorder.Bycontrast,theoverallcomputationaltimesofoverlappingimplementationsaresteadilylessthanthesumofthesethreestepsastheyareperformedconcurrently.Thus,theoverlappingoptimizationstrategycan

reducetheoverallcomputationaltime(Table1).

Table1Theperformancecomparisonbetweenoverlappingandnon-overlappingimplementation(ms).

alt-text:Table1

Programroutines Strategies H2D Kernelexecutions D2H Overallprocessingtime

FLWT Non-overlapping 35 193 32 260

Overlapping 33 197 32 213

ILWT No-overlapping 33 215 38 286

Overlapping 34 206 35 225

5Themulti-GPUsolutionofGPCF-LWTThepowerfulcomputingcapabilityofsingleGPUsatisfiesthecomputationaldemandsofmanyapplications.However,thelimitedmemoryspacesandCUDAcoresofasingleGPUcardmakeitimpossibletoprocessacomplete

dataset in online surface texturemeasurement applications. Thus, dealingwith large-scale dataset requires themulti-GPU processingmode. At present, there are two categories of commonly usedmulti-GPU systems, i.e., the

standalonecomputers(asingleCPUnodewithmultipleGPUcards)andtheclusterdistributedsystems(multipleCPUnodesandeachaccompaniedbyoneormoreGPUcards).Bothcanacceleratedataprocessinggreatlyandgain

bettercomputationalperformance.However,limitedtothelow-speedPCI-Ebusesandnetworkconnections,bigdatatransferringbetweenaCPUhostmemoryspaceandaGPUglobalmemoryspacebecomesthekeybottleneckof

multi-GPU systems. Based on this consideration, reducing or hiding data transfer delays are the most important optimization strategy to obtain high performance. More importantly, since the major consumption of LWT is

Page 17: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

multiplication,additionandsubtraction,anditsharesonlynearbydata(i.e.GPCF-LWTdoesnotneedtosharedatasetsacrossthenetwork),socommunicationrequirementsamongdistributedcomputernodesinLWTapplicationsare

verylow.Fromthispointofview,astandaloneheterogeneousmulti-GPUsystemhasbeenadoptedforGPCF-LWT.ThisstudytriestoreduceidletimeslicesofPCI-EbusandGPUkernelexecutions,andimprovesthecomputationalload

balancingoverthemultipleGPUcardsforthemulti-GPUGPCF-LWTimplementation.Toreduceidletimeslices,thisstudymadetwoimprovementsonGPCF-LWT:(1)maximizingtheutilizationrateofthePCI-Ebusbandwidth;(2)

hidingdatatransferdelaysthroughmakingeveryGPUcardalwaysinthebusystateduringthewholedataprocessingworkflow.Forthefirstpoint,GPCF-LWTappliesbidirectionaldatatransfersonthebandwidthofaPCI-Ebusthat

cansupportbothH2DandD2Hatthesametime.Intermsofthesecondpoint,thepipelineoverlappingstrategypresentedinSection4.4hasbeenappliedoneveryGPUcardintheheterogeneousmulti-GPUsystem.Furthermore,in

GPCF-LWT,datatransfersbetweenaCPUandaGPUjustneedtoreadonce,anditdoesnotrequireanyintermediatecomputationalresultstransferringbetweenaCPUandaGPUoramongdifferentGPUs.Thus,thismulti-GPUbased

GPCF-LWTdoesnotrequireanyextracommunication,andithasnotanydatatransferdelayduringthewholecomputationalprocedure.Nevertheless,thetraditionaldatasetdivisionapproachesarelikelytocauseloadunbalancing

problemandlowrateofGPUhardwareutilization.

5.1LoadunbalancingproblemBasically,duetotheseparablepropertyof2DLWT,asimpleloadbalancingmodelbasedonthepuredatasetdivisionmethodcanbederived[23].Thissolutionissimpleanduseful,butitmayleadtoloadunbalancingproblem

whenamulti-GPUsystemcontainsheterogeneousGPUcardswithunequalcomputingcapability.Asaresult,theoverallperformanceofamulti-GPUsystemdependsontheGPUnodethathasthelowestperformance.Moreover,2D

LWTapplicationsaredata-orientedcomputations,i.e.,hugedataprocessingbutlimitedCUDAkernels.Thus,thisstudydevisedadata-orienteddynamicloadbalancing(DLB)modelforrapidanddynamicdatasetdivisionandallocation

onheterogeneousGPUnodesbyusingafuzzyneuralnetwork(FNN)framework[38].Thedata-orientedDLBmodelcanminimizetheoverallprocessingtimebydynamicallyadjustingthenumberofdatasubsetsinagroupforeach

GPUnodeaccordingtoruntimefeedbacks.Here,twoimportantimprovementsonFNNofthepreviousworkofauthors([398])toenhancetheeffectivenessofDLBmodelarepresented.

5.2ImprovedFNNmodelTheoverallframeworkoftheFNNbasedDLBmodeldevisedinthepreviousworkisillustratedinFig.18[39][38].Inthismodel,fivestatefeedbackparameters(fuzzysets),i.e.,floating-pointoperationperformance(F),global

memorysize(M),parallelability(P),theoccupancyrateofcomputationalresourcesofaGPU(UF)andtheoccupancyrateoftheglobalmemoryspaceofaGPU(UM),relatingcloselytotheoverallcomputationalperformanceare

defined.Thepredictor inFNNisdevisedtopredict therelativecomputationalperformanceabilityunderdifferentworkloadscenarios,andtheschedulercanreorganize thedatagroups foreachGPUcardrespectivelyaccordingto

dynamicfeedbacksofthepredictor.Inthebeginning,arawdatasetisdividedintoseveralequal-sizeddatachunks(thesizeisrelativelysmall,andthenumberofdatachunksshouldbefarmorethanthenumberofGPUnodes)and

thesechunksareorganizedintongroups(nisequaltothenumberofGPUcardsinamulti-GPUsystem)byusingthescheduler.Intheruntime,thenumberofdatachunksinagroupassignedtoeachGPUcardisdifferent,anditis

determineddynamicallybythereal-timefeedbacks(i.e.,fivestatefeedbackparameters)ofasingleGPUcard.Thus,theDLBmodelminimizestheoverallprocessingtimebydynamicallyadjustingthenumberofdatachunksinagroup

foreachGPUatruntime.

Basedontheearlieroutputs,thisstudyfurtherexploredtwoimprovementsonFNNtoincreasetheefficiencyandpredicationaccuracyofthedevisedDLBmodel:

1) ThisstudyhasconstructedotherthreestatefeedbackparametersforimprovingtheaccuracyofthenthrelativecomputationalabilityCPnipredicationforithGPUnode,i.e.,memoryspeed(MS),boostclock(BC)andthelastrelativecomputationalability(CPn-1ior

LCPni).BothmemoryspeedandboostclockarecorehardwareparametersaffectingthecomputingcapabilityofCUDAGPUcards.Forexample,thememoryspeedofGeForceGTX1080GPUcardis10 Gbps(GBspersecond)andtheboostclockis1607 MHz(Mega

Fig.18TheoverallframeworkoftheFNNbasedDLBmodel.

alt-text:Fig18

Page 18: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

Hertz),andtheGeForce750TiGPUcardis1.02 Gbpsand1085 MHz,respectively.ThereisastrongcorrelationbetweenCPn-1iandCPni,i.e.,whenthecurrentCPn-1iisveryhigh,moresubsetswillbeorganizedinthedatagroupofCPUi,anditwillkeepCPUibusywith

highUMandUFvalues,whichinturnleadstothelowCPni.Thus,thebalancebetweenCPn-1iandCPnineedstobeconsideredinpredications.Currently,theimprovedDLBmodelhaseightfeedbackparametersandtheyareallfuzzifiedas“high”and“low”fuzzy

subsets,thesethreekindsoffuzzysetsandsubsetsarelistedinTable2.

2) Thisstudyimprovesthefuzzysubsetdescriptionsbyprovidingacomprehensivemembershipfunctionadaptionmechanism.IntheformerDLBmodel,allfuzzysetsarefuzzifiedbyonemembershipfunction–thesigmoid.Here,thestaticparameters(i.e.,F,M,P,MSand

BC)stillusesigmoidmembershipfunction,whereasthedynamicfeedbackparameters(i.e.,UF,UM,CPandLCP)adoptaninnovativemembershipfunction.Thismembershipfunctionhasbeenexploredbasedonthesoftplusthathasbeenprocessedbyshiftandscalar

operationstosatisfythefundamentalpropertyofafuzzymembershipfunction.Thesoftplusbecameanmainstreamactivationfunctionindeepleaningnetworks,anditcanbeformulatedasEq.(23)Eq.(21)anditscurveisillustratedinFig.19[44].Itgrowsveryslow

whenxisrelativelysmall,anditwillberapidlyincreasedafterxgreaterthanathresholdvalue(e.g.,1inFig.19).ThisphenomenoncanmodelthechangeofcomputationalperformanceofaGPUnodeaswhenUFandUMarerelativelylow,thecomputingcapabilityis

almostunchanged,anditalsowilldropsharplywhenUFandUMarelargerthanthethresholdvalues.However,thesoftpluscannotbeusedasamembershipfunctiondirectlyduetoEq.(23)Eq.(21)failstosatisfyf(x) ∈ [0,1]andx ∈ [0,1].

Table2Threekindsoffuzzysetsandsubsets.

alt-text:Table2

Sets Descriptions Fuzzysubsets Descriptionsoffuzzysubsets

MS MemoryspeedMSL Low

MSH High

BC BoostclockBCL Low

BCH High

LCP ThelastrelativecomputationalabilityLCPL Low

LCPH High

Table3Thespecificationofaheterogeneousmulti-GPUsystem.

alt-text:Table3

Property Description

CPU IntelCorei7-47903.6GHZ

Memory 16G

GPU GPU1:NVIDIAGeForceGTX750Ti,2 G,5 × SM,128SP/SM,1.02 Gbpsmemoryspeed,1085 MHzboostclock

3 × GPU2:NVIDIAGeForceGTX1080,8 G,20 × SM,128SP/SM,10 Gbpsmemoryspeed,1607 MHzboostclock

OS Windows1064bit

CUDA Version8.0

(21)

Page 19: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

Toremedythisdefect,thisstudyperformsshiftandscalaroperationsonthesoftplusfunction,sothesoftpluscanbeextendedasfollows:

whereaisascalarfactorandbisashiftfactor,andEq.(24)Eq.(22)mustsatisfy:

Byaddinganormalizationfactor,Eq.(24)Eq.(22)canberedefinedasEq.(26)Eq.(24),andaandbcantakethevalueof5and−3,respectively.Fig.20showsthecurveoftheimprovedsoftplus.

Basedonthethreenewstatefeedbackparametersandtheimprovedsoftplus,thisstudyconstructedamoreeffectiveDLBmodelthatintegratesthefuzzytheory,artificialneuralnetworks(ANN)andthebackpropagation

algorithm.ThecomparisonexperimentbetweentheimproveDLBmodelandtheconventionalDLBmodelin[39][38]ispresentedinSection6.2(thefourthfifthexperiment).

Fig.19Thecurveofsoftplusfunction.

alt-text:Fig19

(22)

(23)

(24)

Fig.20Thecurveoftheimprovedsoftplusfunction.

alt-text:Fig20

Page 20: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

6Testandperformanceevaluation6.1Hardwareandtestbed

Table3specifiesaheterogeneousmulti-GPUsystemconstructedfortestingandevaluatingGPCF-LWTwithexperimentsandtheapplicablecasestudy,anditcontainsfourdifferenttypesofGPUcards‒—amiddle-lowrange

GPU(NVIDIAGeForceGTX750Ti)andthreehigh-endGPUs(NVIDIAGeForceGTX1080).GTX750TiandGTX1080contain640and2560CUDAcores,respectively.

6.2BenchmarkingexperimentsThissectiontestsandevaluatesthecomputationalperformanceandpracticabilityofGPCF-LWTunderbothaheterogeneousmulti-GPUsettingandahigh-endsingleGPUsettinginfiveencompassingexperiments:

1) ThefirstexperimentcomparesthecomputationalperformancesoftheLBB,RCandRC&LB[4]basedontheclassicliftingschemebyusingthesingleGPUsetting(aGTX1080GPUcard).TheLBB,RCandRC&LBhavebeentestedonthe4-levelCDF(9,7)forward

andinverse2DLWT.Thedatasizerangesfrom10241024××1024to51205120××5120.Fig.21showsthatwith increasingdatasize, theperformancefromRCisplungingquicklydueto intrinsicsynchronizationoperations.RC&LBshowshighcomputational

performancewhenthedatasizesarerelativelysmall(e.g.,equaltoorsmallerthan20482048××2048).Theprocessingtimewillincreasedrasticallyoncethedatasizegoesupforextrasynchronizationcalls.Incontrast,thedevisedLBBmodelcanmaintainhigh-

performance for larger data range. This experimental demonstrated that the LBBmodel in the GPCF-LWT framework is superior and gains up to 14 times speedup in comparisonwith RC, and up to 7 timeswhen comparedwith RC&LB under a single GPU

configuration.

2) ThesecondexperimentteststheeffectivenessoftheimprovedliftingschemebyusingthesingleGPUsetting.Italsoperformsthe4-levelCDF(9,7)forwardandinverse2DLWT.InFig.22,theimprovedliftingschemegainshighercomputationalperformancesinceit

reduceshalfoftherequiredsynchronizationinliftingstep(step3)(seeSection4.2).ItsubstantiallyincreasestheparallelizabilityofLWTcomputations.

3) ToevaluatethegeneralizationfeatureofCPCF-LWT,thethirdexperimentteststhecomputationalperformanceofsixtypesofwavelets,namelyHaar,D4,D8,D20,CDF(7,5)andCDF(9,7),usingtheclassicliftingschemewithRC;,theclassicliftingschemewith

RC&LB;,andtheimprovedliftingschemewithLBB.AsingleGTX1080GPUcardisusedforathe4-level2DLWTofthewithdatasize4096 × 4096.Inthisexperiment,differentwavelettransformshavebeenperformedinagenericmannerwithdifferentU,PandKfor

differentwavelets.TheexperimentalresultshighlightthatthecomputationalperformanceofCPCF-LWTissignificantlyhigherthanthecorrespondingRCandRC&LBbasedimplementations,seeFig.23.

4) ThefourthexperimentevaluatestheeffectivenessofthethreeoptimizationstrategiesequipmentinGPCF-LWT,againonasingleGTX1080GPUcard.SinceeachstrategyhasbeentestedseparatelyinSection4.4,thisexperimentfocusesonthecomparisonbetween

theLBBwiththehybridstrategy-driven(thecombinationofthreeoptimizations)andLBBwithoutanyoptimization.Fig.24showsthatthehybridapproachgainnoticeableimprovementswhencomparingtothepureLBB.

5) ThefifthexperimentfurtherexaminedthecomputationalperformancesofGPCF-LWTwhenappliedontwosingleGPUsettingsandthreemulti-GPUsettingsbyusingtheheterogeneousGPUsystemlistedinTable3.ThisexperimenthasadaptedaCDF(9,7)waveletto

perform4-level2DforwardandinverseLWToflargedatasizesrangingfrom10,240240××10,240to16,384384××16,384.Fig.25liststheprocessingtimeofthefivetestsettings,i.e.,thesingleGTX750TiGPUsetting(setting1),thesingleGTX1080GPUsetting

(setting2), pure dataset division basedmulti-GPU setting (setting3), the original DLB basedmulti-GPU setting (setting4) [38] and the improved DLB basedmulti-GPU setting (setting5). It can be seen from Fig. 25 that setting1 implementation has the lowest

computationalperformance,followedbysetting2sincetheGTX1080GPUcontainsmoreCUDAcores(2560), largerglobalmemoryspace(8GB)andhigherdatatransferspeed(10 Gbps)thantheGTX750TiGPU(640,2GB,5.4 Gbps, respectively).However, the

computationalperformanceimprovementofsetting2isverylimited,whichprovesthatitwillnotleadtothelinearcomputationalperformanceimprovementforlarge-scalecomputationalapplicationsaccompaniedbyextremelylargeinputvolumeandhighlyrepetitive

simpleoperationalproceduresoriterations,specificallyfor2DLWTbysimplychangingabetterGPUcard.Thesetting3implementationscangainaslightlyhighercomputationalperformanceimprovementthanthetwosingleGPUimplementationsforabout2and3

times.Incontrast, theDLB-drivenGPCF-LWTmulti-GPUimplementations(setting4andsetting5)achievesignificantcomputationalperformanceenhancementsince theFNNbasedDLBcanallocatedata tasksdynamicallyduringruntimeaccordingto therelative

computationalperformanceabilityacrossallGPUnodes.Moreover,theimprovedDLB(setting5)hasgainedbetterperformancethantheoriginalversion(setting4)foralldatasizesduetothesoftplusbasedmembershipfunctionthatismoresuitableforadaptingthe

dynamicchangeofcomputationalperformanceofheteronomousGPUnodes.Thepeakperformancegain(speedupondatasizeof16,384 × 16,384)ofthesetting5hasreachedapproximately10timesmorethansetting3;andwhencomparedwiththetwosingleGPU

implementations,thespeedupfromsetting5hasreached15and20timesthansetting1andsetting2respectively.

Page 21: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

Fig.21TheperformancecomparisonamongLBB,RCandRC&LBmodels.

alt-text:Fig21

Fig.22Theperformancecomparisonbetweentheclassicandimprovedliftingscheme.

alt-text:Fig22

Annotations:

A1. deletetheextra"2D"

Fig.23ThegeneralizationfeatureevaluationofCPCF-LWT.

alt-text:Fig23

A1

Page 22: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

6.3AcasestudyononlineengineeringsurfacefiltrationThissectioncarriesacasestudyononlineengineeringsurfacefiltrationtotesttheusabilityandefficiencyofthedevisedmulti-GPUGPCF-LWTincomparisonwithotherstate-of-the-artmulti-GPUapproaches.Inthefieldof

arealsurfacetexturecharacterizationextractions,geometricalcharacteristicsignalsrelatingcloselywith functionalrequirementscanbeextractedprecisely fromwaveletcoefficientsbyusingdifferent transmissionbands.The2D

forwardLWTcomputationsshouldbeperformedonraw2DsurfacemeasureddatasetstoobtaininstancesofdetailcoefficientsDiatalllevelsandtheapproximationcoefficientcAinstancesfromthelastlevel,andtheinstancesofcAiof

eachlevelcanbeobtainedduringperforming2DinverseLWT.BasedoninstancesofDiandcAi,thegeometricalcharacteristics(i.e.,roughness,wavinessandform)ofasurfacetexturecanbeobtainedbycalculatingthefollowing

inverseLWTdefinedinEquationEq.(27)Eq.(25)[10,40].

Inordertoextracttheroughnessη(x,y)andwavinessη′(x,y)fromasurfacetextureprofile,thecAimustbesetuptozero.Similarly,theDimustbesetuptozeroforextractingtheformcharacteristic.

Fig.24TheeffectivenessexperimentofthethreeoptimizationstrategiesinGPCF-LWT.

alt-text:Fig24

Fig.25Thecomparisonofprocessingtimesofsetting1,setting2,setting3,setting4andsetting5implementations.

alt-text:Fig25

(25)

Page 23: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

Thisstudyselectedtwokindsofrawengineeringsurfaces(surfacesonceramicfemoralheadsandmilledmetallicworkpieces)totestandevaluatetheGPCF-LWT.TheGPCF-LWTintegratesthedevisedLBB,improvedlifting

scheme,thethreeimplementationoptimizations,andtheimproveddata-orientedDLB.ThisanalysisevaluatestheSURFSTANDthatisasurfacecharacterizationsystememployingmodernCPUcore[10],andtwoothermulti-GPU

implementationsofRC&LBandthemethodproposedbyIkuzawa[46][45].Itshouldbenotedthatthepuredatasetdivisionmethodwasappliedforthetwomulti-GPUimplementations.Theheterogeneousmulti-GPUlistedinTable3is

usedforallthreemulti-GPUimplementations.

Figs.26aand27ashowarawmeasureddatasetfromthefunctionalsurfaceofaceramicfemoralheadandamilledmetallicworkpiecerespectivelywiththesamplingsize40964096××4096.InFig.26a,themeasuredsurface

texturehastwodifferenttypesofscratchesproducedbythegrindingmanufacturingprocess,andoneofthemisregularandshallowscratcheswhiletheotherisrandomdeeperscratch.Thesix-levelD4wavelethasbeenusedinthis

casetoperformbothforwardandinverse2DLWT.TheFig.26b-ddemonstratetheroughness,wavinessandformcharacteristicsextractedfromafemoralheadsurfacebyusingtheinverseLWTdefinedinEquationEq.(27)Eq.(25).Inthe

caseofthemilledmetallicsurface(seeFig.27a),thesix-levelCDF(9,7)wavelethasbeenusedtoperform2DLWT,andFig.27b-–ddemonstratetheroughness,waviness,andformcharacteristicsextractedfromamilledmetallic

surfacebyusingtheinverseLWTdefinedinEquationEq.(27)Eq.(25).

It isworthmentioning thatCUDAsupports thesingle float (32-bit,abbreviatedasFP32)and thehalf float (16-bit,FP16)precisions [19,45][19,46].Thisexperiment testedbothof these twoprecisions.Fromthis test, it is

discoveredthatfloatprecisionsettingsdonotaffectthecomputationalperformanceof2DLWTonCPUduetotheIntelCorei7-4790CPUuses64-bitinstruction.BothFP16andFP32operationscanbeexecutedonthe64-bitmodein

negligibledifference.Incontrast,FP16basedmulti-GPUimplementationsshowhighercomputationalperformancethantheFP32implementations.However,thelowerprecisionofFP16failstofulfilthehighprecisionrequirementof

micro- andnano-surface texturemetrology, such that this case studyonlyapplies theFP32 forevaluation.The results show that themulti-GPUGPCF-LWTandpureCPU (IntelCore i7-47903.6GHZ) implementationshave same

filtrationeffectsonbothprecisionandaccuracyaspects.Table4showsthatmulti-GPUimplementationscangainover20timesspeedupoverCPUimplementations.

Fig.26Thefunctionalsurfacetextureofaceramicfemoralhead.

alt-text:Fig26

Annotations:

A1. Theformsurface

A2. Thewavysurface

A3. Theroughsurface

Fig.27Thefunctionalsurfacetextureofamilledmetallicworkpiece.

alt-text:Fig27

Annotations:

A1. Theformsurface

A1A2A3

A1

Page 24: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

Table4TheperformancecomparisonsbetweenaCPUandthemulti-GPUimplementation(ms).

alt-text:Table4

Datasets LWT ILWT

CPU GPCF-LWT CPU GPCF-LWT

Ceramicfemoralhead(4096 × 4096) 10,165 415 12,590 450

Ceramicfemoralhead(12,288 × 12,288) 28,569 756 34,681 832

Milledmetallicworkpiece(4096 × 4096) 12,850 510 12,685 511

Milledmetallicworkpiece(12,288 × 12,288) 30,256 748 35,647 786

Fig.28 shows theperformancecomparisonamongGPCF-LWT,RC&LBand Ikuzawa'ssolutionon theheterogeneousmulti-GPUsystemdetailed inTable3. In this case, the testingdatasethasbeenexpanded to the sizeof

12,288288××12,288toevaluatewhetherGPCF-LWTcandealwithlarge-scaledatasetsintheonlinemanner.TheexperimentalresultsshowthatRC&LBhastheminimumcomputationalperformanceduetoitsheavysynchronization

calls.Ikuzawa'ssolutiongainsbetterperformanceduetoitshigherefficiencyonmemoryusage.Thoughbothstillsufferfromtheheavysynchronization,pipelinecluttering,anddataoverloadproblems.Incontrast,theGPCF-LWTwith

advancedoptimizationsgainsupto12timesonprocessingspeedoverothertwomulti-GPUapproaches.

Insummary,multi-GPUbasedGPCF-LWThasgainedsignificantimprovementonthecomputationalperformancefor2DLWT.Itcanprocessaverylargedataset(e.g.,12,288288××12,288)inlessthanonesecond,sothat

performancerequirementsofonline(ornearreal-time)andlarge-scaledataintensivesurfacetexturefiltrationapplicationscanbesatisfiedgracefully.

Fig.28Theperformancecomparisonsofthreemulti-GPUimplementations(ms).

alt-text:Fig28

Annotations:

A1. Ceramicfemoralhead

A2. Ceramicfemoralhead

A3. Milledmetallicworkpiece

A4. Milledmetallicworkpiece

A5. Ceramicfemoralhead

A6. Ceramicfemoralhead

A7. Milledmetallicworkpiece

A8. Milledmetallicworkpiece

A1A2

A3 A4 A5 A6A7

A8

Page 25: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

7ConclusionsandfutureworkTofulfilltheonlineandbigdatarequirementsinmicro-andnano-surfacemetrology,theparallelcomputationalpowerofmodernGPUsisexplored.Thisresearchdevisedagenericparallelcomputationalframeworkforlifting

wavelettransform(GPCF-LWT)accelerationbasedonaheterogeneousmulti-GPUsystem.OneoftheoutcomesofthisinnovativeinfrastructureisanewparallelcomputationalmodelnamedLBBtoalleviatethevitalproblemofshared

memory shortage in a single GPU cardwhile ensuring the generalization of LWT in the context of CUDAmemory hierarchy. To increase the parallelizability of LWT, an improved lifting scheme has also been developed. Three

optimizationstrategiesonthesingleGPUconfigurationarevalidatedandthenintegrated.TofurtherimprovethecomputationalperformanceoftheGPCF-LWT,ithasbeenexpandedtocomputeonaheterogeneousmulti-GPUsystem

withaFuzzyTheory-basedloadbalancingmodel.ThisstudyfurtherexploredtwoimprovementsonthepreviousdevisedFNNDLBmodeltoincreasetheefficiencyandpredicationaccuracy.ExperimentsshowthattheproposedGPCF-

LWTcanachievesuperiorandsubstantialcomputationalperformancegainwhencomparedwithotherstate-of-the-arttechniques.Over20timesspeeduphasbeenmonitoredoverthepureCPUsolutions,andupto12timesoverother

benchmarkingmulti-GPUsolutionswithlarge-scalesurfacemeasurementdatasets.Theinnovativeframeworkanditscorrespondingtechniqueshaveaddressedthekeychallengesfrom2DLWTapplicationsthatarevitalforbigsurface

datasetprocessing.

Oneavenueopenedupduring the study for futureexploration is to introduce thedeep learning idealismacross theGPUandCPUboundary.Especiallywhen facingcell-CPUor clusterCPUs, ahybridandefficientdata

distributionscheme,aswellasanonlinesurfacetextureanalysisplatformwithsmartrecommendationfunctionsforfilterselectionwillbeofgreatvalueforpractitioners.

DeclarationofCompetingInterestTheauthorsdeclarethattheyhavenoknowncompetingfinancialinterestsorpersonalrelationshipsthatcouldhaveappearedtoinfluencetheworkreportedinthispaper.

AcknowledgmentsThisworkissupportedbytheNSFC(61203172),theSichuanScienceandTechnologyPrograms(2018JY0146and2019YFH0187),and“598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP”fromtheEuropeanCommission.

References[1]X.J.JiangandD.J.Whitehouse,Technologicalshiftsinsurfacemetrology,CIRPAnn61(2),2012,815–836,https://doi.org/10.1016/j.cirp.2012.05.009.

[2]D.A.Axinte,N.Gindy,etal.,Processmonitoringtoassisttheworkpiecesurfacequalityinmachining,Int.J.Mach.ToolsManuf.44(10),2004,1091–1108,https://doi.org/10.1016/j.ijmachtools.2004.02.020.

[3]X.Jiang,P.J.Scott,etal.,Paradigmshiftsinsurfacemetrology.PartI.Historicalphilosophy,Proc.R.Soc.AMath.Phys.Eng.Sci.463(2085),2007,2049–2070,https://doi.org/10.1098/rspa.2007.1874.

[4]W.J.vanderLaan,A.C.Jalba,etal.,AcceleratingwaveletliftingongraphicshardwareusingCUDA,IEEETrans.ParallelDistrib.Syst.22(1),2011,132–146,https://doi.org/10.1109/TPDS.2010.143.

[5]J.A.Castelar,C.A.Angulo,etal.,ParalleldecompressionofseismicdataonGPUusingaliftingwaveletalgorithm,SignalProcess.ImagesComput.Vis.,2015,IEEE.

[6]S.Lou,W.H.H.Zeng,etal.,Robustfiltrationtechniquesingeometricalmetrologyandtheircomparison,Int.J.Autom.Comput.10(1),2013,1–8,https://doi.org/10.1007/s11633-013-0690-4.

[7]H.S.Abdul-Rahman,S.Lou,etal.,Freeformtexturerepresentationandcharacterisationbasedontriangularmeshprojectiontechniques,Measurement92,2016,172–182,

https://doi.org/10.1016/j.measurement.2016.06.005.

[8]ISO16610-61:2015Geometricalproductspecification(GPS)–filtration–part61:lineararealfilters–Gaussianfilters,(2015).

[9]ASMEB46.1-2009,Surfacetexture:surfaceroughness,waviness,andlay,(2009).

[10]P.Nejedly,F.Plesinger,etal.,CudaFilters:asignalplantlibraryforGPU-acceleratedFFTandFIR (TheoriginalreferenceNo.29becomesNo.10,originalNo.10becomestheNo.11,No.11becomesNo.12,andNo.12becomesNo.13,etc.)filtering,Softw.-Pract.Exp.48(1),2018,3–9,https://doi.org/10.1002/spe.2507.

[11]X.Jiang,P.J.Scott,etal.,Paradigmshiftsinsurfacemetrology.PartII.Thecurrentshift,Proc.R.Soc.AMath.Phys.Eng.Sci.463(2085),2007,2071–2099,https://doi.org/10.1098/rspa.2007.1873.

[12]W.Sweldens,Theliftingscheme:aconstructionofsecondgenerationwavelets,SIAMJ.Math.Anal.29(2),1998,511–546.

[13]S.Xu,W.Xue,etal.,Performancemodelingandoptimizationofsparsematrix-vectormultiplicationonNVIDIACUDAplatform,J.Supercomput.63(3),2013,710–721,https://doi.org/10.1007/s11227-011-0626-0.

Page 26: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

[14]H.Wang,H.Peng,etal.,AsurveyofGPU-basedaccelerationtechniquesinMRIreconstructions,Quant.ImagingMed.Surg.8(2),2018,196–208,https://doi.org/10.21037/qims.2018.03.07.

[15]S.MittalandJ.S.Vetter,AsurveyofCPU-GPUheterogeneouscomputingtechniques,ACMComput.Surv.47(4),2015,1–35,https://doi.org/10.1145/2788396.

[16]M.Mccool,Signalprocessingandgeneral-purposecomputingandGPUs,IEEESignalProcess.Mag.24(3),2007,109–114,https://doi.org/10.1109/MSP.2007.361608.

[17]Y.Su,Z.Xu,etal.,GPGPU-basedGaussianfilteringforsurfacemetrologicaldataprocessing,In:200812thInt.Conf.Inf.Vis.,2008,IEEE,94–99,https://doi.org/10.1109/IV.2008.14.

[18]NVIDIA,CUDACProgrammingGuidev9.0.http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[19]S.Cook,CUDAProgramming:ADeveloper'sGuidetoParallelComputingwithGPUs,2012,Newnes.

[20]NVIDIA,CUDACBestPracticesGuidev9.0,NVIDIA.http://docs.nvidia.com/cuda/cuda-c-best-practices-guide.

[21]D.B.KirkandW.H.Wen-Mei,ProgrammingMassivelyParallelProcessors:AHands-onApproach,2012,MorganKaufmann.

[22]R.Couturier,DesigningScientificApplicationsOnGPUs,2013,CRCPress.

[23]D.FoleyandJ.Danskin,Ultra-performancePascalGPUandNVLinkinterconnect,IEEEMicro37(2),2017,7–17,https://doi.org/10.1109/MM.2017.37.

[24]M.HopfandT.Ertl,Hardwareacceleratedwavelettransformations,DataVis.,2000,Springer,93–103,https://doi.org/10.1007/978-3-7091-6783-0_10.

[25]Tien-TsinWong,Chi-SingLeung,etal.,Discretewavelettransformonconsumer-levelgraphicshardware,IEEETrans.Multimed.9(3),2007,668–673,https://doi.org/10.1109/TMM.2006.887994.

[26]J.Franco,G.Bernabé,etal.,The2Dwavelettransformonemergingarchitectures:gPUsandmulticores,J.Real-TimeImageProcess7(3),2012,145–152,https://doi.org/10.1007/s11554-011-0224-7.

[27]C.Song,Y.Li,etal.,AGPU-AcceleratedwaveletdecompressionsystemwithSPIHTandreed-solomondecodingforsatelliteimages,IEEEJ.Sel.Top.Appl.EarthObs.RemoteSens4(3),2011,683–690,https://doi.org/10.1109/JSTARS.2011.2159962.

[28]L.Zhu,Y.Zhou,etal.,Parallelmulti-level2D-DWTonCUDAGPUsanditsapplicationinringartifactremoval,Concurr.Comput.Pract.Exp.27(17),2015,5188–5202,https://doi.org/10.1002/cpe.3559.

[29]L.BluntandX.Jiang,AdvancedTechniquesForAssessmentSurfaceTopography:DevelopmentofaBasisFor3DSurfaceTextureStandardsSurfstand (HereoriginalNo.28becomesNo.29,andtherestisnumberedsuccessively.),2003,Elsevier.

[30]WangJianjun,LuWenlong,etal.,High-speedparallelwaveletalgorithmbasedonCUDAanditsapplicationinthree-dimensionalsurfacetextureanalysis,In:Int.Conf.Electr.Inf.ControlEng.,2011,IEEE,2249–2252,https://doi.org/10.1109/ICEICE.2011.5778225.

[31]Y.Chen,X.Cui,etal.,Large-scaleFFTonGPUclusters,In:Proc.24thACMInt.Conf.Supercomput.,NewYork,NewYork,USA2010,ACMPresshttps://doi.org/10.1145/1810085.1810128.

[32]A.Nukada,Y.Maruyama,etal.,Highperformance3-DFFTusingmultipleCUDAGPUs,In:Proc.5thAnnu.Work.Gen.Purp.Process.withGraph.Process.Units-GPGPU,NewYork,NewYork,USA2012,ACMPress,

57–63,https://doi.org/10.1145/2159430.2159437.

[33]S.SchaetzandM.Uecker,Amulti-GPUprogramminglibraryforreal-timeapplications,In:Int.Conf.AlgorithmsArchit.ParallelProcess,2012,Springer,114–128.

[34]J.A.StuartandJ.D.Owens,Multi-GPUMapReduceonGPUclusters,In:2011IEEEInt.ParallelDistrib.Process.Symp.,2011,IEEE,1068–1079,https://doi.org/10.1109/IPDPS.2011.102.

[35]M.Grossman,M.Breternitz,etal.,HadoopCL:MapReduceondistributedheterogeneousplatformsthroughseamlessintegrationofhadoopandOpenCL,In:2013IEEEInt.Symp.ParallelDistrib.Process,2013,IEEE,

1918–1927,https://doi.org/10.1109/IPDPSW.2013.246.

[36]M.Abadi,A.Agarwal,etal.,TensorFlow:large-Scalemachinelearningonheterogeneousdistributedsystems,(2016).doi:10.1038/nn.3331.

[37]S.Chintala,AnoverviewofdeeplearningframeworksandanintroductiontoPytorch,2017.https://smartech.gatech.edu/handle/1853/58786(accessedOctober5,2017).

Page 27: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

QueriesandAnswersQuery:Pleaseconfirmthatgivennamesandsurnameshavebeenidentifiedcorrectly.Answer:Yes

Query:Pleaseconfirmthattheprovidedemail18668289@qq.comisthecorrectaddressforofficialcommunication,elseprovideanalternatee-mailaddresstoreplacetheexistingone.Answer:[email protected]

Query:PleasecheckthecitationofTable1inthetext.Ithasbeenputrandomly.Pleasefixitinproperplace.Answer:ThecitationofTable1hasbeencorrectedinthetext.

Query:PleasecheckthecitationofEq.(26)inthetext,ButthereisnoEq.(26)inthetext.Pleasecorrectifnecessary.Answer:Eq.(24)isactuallyEq.(22),andEq.26isactuallyEq.(24).Otherwrongequationnumberhadbeencorrectedintext.

Query:PleasecheckthecitationofEq.27inthetext.Itisnotinthetext.Pleasecorrectitifnecessary.

[38]C.Zhang,Y.Xu,etal.,Afuzzyneuralnetworkbaseddynamicdataallocationmodelonheterogeneousmulti-Gpusforlarge-scalecomputations,Int.J.Autom.Comput.15(2),2018,181–193,https://doi.org/10.1007/s11633-018-1120-4.

[39]K.K.ShuklaandA.K.Tiwari,FilterbanksandDWT,SpringerBriefsComput.Sci.,2013,Springer,21–36,https://doi.org/10.1007/978-1-4471-4941-5_2.

[40]X.Q.Jiang,L.Blunt,etal.,Developmentofaliftingwaveletrepresentationforsurfacecharacterization,Proc.R.Soc.AMath.Phys.Eng.Sci.456(2001),2000,2283–2313,https://doi.org/10.1098/rspa.2000.0613.

[41]M.E.Angelopoulou,K.Masselos,etal.,Implementationandcomparisonofthe5/3lifting2DdiscretewavelettransformcomputationschedulesonFPGAs,J.SignalProcess.Syst.51(1),2008,3–21,

https://doi.org/10.1007/s11265-007-0139-5.

[42]C.Tenllado,J.Setoain,etal.,Parallelimplementationofthe2DdiscretewavelettransformongraphicsprocessingUnits:filterbankversuslifting,IEEETrans.ParallelDistrib.Syst.19(3),2008,299–310,https://doi.org/10.1109/TPDS.2007.70716.

[43]W.Nicholas,TheCUDAHandbook:AComprehensiveGuidetoGPUProgramming,2013.

[44]Y.LeCun,Y.Bengio,etal.,Deeplearning,Nature521(7553),2015,436–444,https://doi.org/10.1038/nature14539.

[45]T.Ikuzawa,F.Ino,etal.,Reducingmemoryusagebythelifting-baseddiscretewavelettransformwithaunifiedbufferonaGPU,J.ParallelDistrib.Comput.93-94,2016,44–55,https://doi.org/10.1016/j.jpdc.2016.03.010.

[46]NVIDIA,Whitepaper-NVIDIAGeForceGTX750Ti,2014.

Highlights

• AnewparallelcomputationmodelforLWTnamedLBBwasdevisedtotacklethelimitationofcurrentsharedmemorybottleneckonmodernGPUs.

• TheimprovedliftingschemehasreducedapproximatelyhalfofthesynchronizationsneededwhencomputingtheliftingwaveletsinasingleCUDAthread.

• ThreeoptimizationstrategieswerebenchmarkedforGPCF-LWTimplementation.

• Themulti-GPUbasedGPCF-LWTsupportsonlinelarge-scalemeasureddatafiltrationthroughanimprovedFNNbaseddynamicloadbalancing(DLB)model.

Page 28: 1 Introduction · 1 Introduction Engineering workpieces are designed, manufactured and measured to meet tolerance specifications of various geometrical characteristics (e.g., surface

Answer:Thereare15placesthattheequationnumberiswrong.Wecorrectedandmarkedinthetext.Thankyou.

Query:Pleasecheckfundinginformationandconfirmitscorrectness.Answer:Iconfirmitscorrectness

Query:Pleaseprovidedetailofreferencenos.8and9.Answer:[8]ISO16610-61:2015,Geometricalproductspecification(GPS)–filtration–part61:lineararealfilters–Gaussianfilters,AnISO/TC213dimensionalandgeometricalproductspecificationsandverificationstandard,2015,1-18,https://www.iso.org/standard/60813.html.

[9]ASMEB46.1-2009,Surfacetexture:surfaceroughness,waviness,andlay,AnAmericannationalstandard,2009,1-120,https://www.asme.org/products/codes-standards/b461-2009-surface-texture-surface-roughness.

Query:Pleaseprovidedetailofreferenceno.36.Answer:M.Abadi,A.Agarwal,etal.,TensorFlow:large-Scalemachinelearningonheterogeneousdistributedsystems,TensorFlowwhitepaper,2016,1-19,https://doi.org/10.1038/nn.3331.

Query:PleaseprovidePublishernameandplaceforreferenceno.45.Answer:No.45isajournalpaper,andYoumeanNo.46?NVIDIA,NVIDIAGeForceGTX750Ti,Whitepaper,NVIDIACorporation,2014,1-10,https://international.download.nvidia.cn/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf.