Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering...

Preview:

Citation preview

Sandia National Laboratories is a multi-mission laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Large-ScaleDataAnalyticsandItsRelationshiptoSimulationCMSE Frontiers in Data Science and Computing WorkshopMichigan State UniversityOctober 4, 2016 RobLeland

VicePresident,Science&TechnologyChiefTechnologyOfficerSandiaNationalLaboratories

SAND2016-9893

Outline

2

§ Somenecessarybackground

§ AchargefromtheNationalStrategicComputingInitiative

§ Answerstothreekeyquestions§ Whyisaincreasingcoherencebetweensimulationandanalyticsimportant?§ Whatisreallymeantby“increasingcoherence”betweenthetwo?§ Howmightcoherencebefurtheredinpractice?

§ Aunifyingvision

Termsandcontext

3

§ Simulation§ Computationstounderstandphysicalphenomenaorconductengineering

§ LargeScaleDataAnalytics(LSDA)§ DataAnalytics=Discoveringmeaningfulpatternsindata§ LargeScale=Requiringleading-edgeprocessingandstoragecapabilities

§ LSDAisincreasinginimportance§ Pervasive

§Commerce,finance,healthcare,science,engineering,nationalsecurity,...§ Lastingsocietalsignificance

§ Internetsearch,genomics,climatemodeling,Higgsparticle,...

§ LSDAisgetting“harder”§ Captureddatagrowingexponentiallywithtime§ Individualanalysisbecomingmoresophisticated§ Morepeopleexaminingmoredatamorefrequently§ AggregateworkgrowingmuchfasterthanMoore’sLaw

TheEconomist:

NationalStrategicComputingInitiative(NSCI)

4

NSCIStrategicObjectives

5

§ (1)Acceleratingdeliveryofacapableexascale computingsystemthatintegrateshardwareandsoftwarecapabilitytodeliverapproximately100timestheperformanceofcurrent10petaflopsystemsacrossarangeofapplicationsrepresentinggovernmentneeds.

§ (2)Increasingcoherencebetweenthetechnologybaseusedformodelingandsimulationandthatusedfordataanalyticcomputing.

§ (3)Establishing,overthenext15years,aviablepathforwardforfutureHPCsystemsevenafterthelimitsofcurrentsemiconductortechnologyarereached(the"post-Moore'sLawera").

§ (4)IncreasingthecapacityandcapabilityofanenduringnationalHPCecosystembyemployingaholisticapproachthataddressesrelevantfactorssuchasnetworkingtechnology,workflow,downwardscaling,foundationalalgorithmsandsoftware,accessibility,andworkforcedevelopment.

§ (5)Developinganenduringpublic-privatecollaborationtoensurethatthebenefitsoftheresearchanddevelopmentadvancesare,tothegreatestextent,sharedbetweentheUnitedStatesGovernmentandindustrialandacademicsectors.

Q1:Whyisincreasingcoherencebetweensimulationandanalyticsimportant?

6

§ Forsimulation§ HPCsimulationmustrideonsomecommoditycurve§ Largermarketforcesbehindanalytics§ Canexploitcommoditycomponenttechnologyfromanalytics

§ Foranalytics§ LargeScaleDataAnalyticsproblemsbecomingevermoresophisticated§ Requiringmorecoupledmethods§ CanexploitarchitecturallessonsfromHPCsimulation

§ Forboth:Integrationofsimulationandanalyticsinthesameworkflow§ Automationofanalysisofdatafromsimulation§ Creationofsyntheticdataviasimulationtoaugmentanalysis§ Automatedgenerationandtestingofhypothesis§ Explorationofnewscientificandtechnicalscenarios§ ...

Mutualinspiration,technicalsynergy,andeconomiesofscaleinthecreation,deployment,anduseofHPCresources

7

Achallengebecausesimulationandanalyticsdifferinmanyrespects…

DatastructuresdescribingsimulationandanalyticsdifferGraphsfromsimulationsmaybeirregular,buthavemorelocalitythanthosederivedfromanalytics

ComputationalSimulationofphysicalphenomena:

Climatemodeling Carcrash

Internetconnectivity Yeastproteininteractions

LargeScaleDataAnalytics:

FiguresfromLelandet.al.courtesyofYelick,LBNL.

TheU.S.roadmap,whichhasspatiallocalityandisthusmostsimilarofthethreeinstructuretocomputationalpatternsthatwouldariseintypicalphysicalsimulations.

Computationandcommunicationpatternsdiffer

Black =timespentcomputingGreen =timespentcommunicatingWhite =timespentwaitingfordatatobecommunicated

TheErdős-Rényi graph,awell-studiedexampleingraphtheorywork.

A scale-freegraph,anexamplemorereflectiveofreal-worldnetworks.

FigurefromLelandet.al.courtesyofJohnson,PNNL.

Simulation

Analytics

Standardbenchmarksinclude:• LINPACK(smallestdataintensiveness;barelyvisibleongraph)• STREAM• SPECFP• SpecInt

MemoryperformancedemandsdifferAkeydifferentiatorintheperformanceofsimulationandanalytics

FigurefromMurphy&Kogge withadjustmenttodoubleradiusofLinpack datapointtomakeitvisible.

Areaofthecircle=relativedataintensiveness(i.e.totalamountofuniquedataaccessed overafixedintervalofinstructions)

Simulation

Analytics

Applicationcodeproperty Simulation Analytics

Spatiallocality High Low

Temporallocality Moderate Low

Memoryfootprint Moderate High

Computationtype Maybefloating-pointdominated* Integerintensive

Input-outputorientation Outputdominated Inputdominated

*Increasingly,simulationworkhasbecomelessfloating-pointdominated

Applicationcodecharacteristicsdiffer

Contrastingproperties:

Q2:Sowhatismeantby“increasingcoherence”betweensimulationandanalytics?

12

§ NOTonesystemostensiblyoptimizedforbothsimulationandanalytics

§ Greatercommonalityinunderlyingcomponentryanddesignprinciples

§ Greaterinteroperability,allowinginterleavingofbothtypesofcomputations

…Amorecommonhardwareandsoftwareroadmapbetweensimulationandanalytics

13

Andyet,thereishope…

Simulationandanalyticsareevolvingtobecomemoresimilarintheirarchitecturalneeds

14

§ CurrentchallengesfortheLSDAcommunity§ Datamovement§ Powerconsumption§ Memory/interconnectbandwidth§ Scalingefficiency

§ InstructionmixforSandia’sHPCengineeringcodes§ Memoryoperations 40%§ Integeroperations 40%§ Floatingpoint 10%§ Other 10%

§ Commondesignimpactsofenergycosttrends§ Increasedconcurrency(processingthreads,cores,memorydepth)§ Increasedcomplexityandburdenon

§ systemsoftware,languages,tools,runtimesupport,codes

…similartoHPCsimulation

…similartoLSDA

Energycostofmovingdataisbecomingdominant

Energyco

st,inpicojoules

(pJ),per

64-bitflo

ating-po

into

peratio

n

Costestimatesfortechnologyyear

Energycostforvariouscommonoperations

FromDanMcMorrow,TechnicalChallengesofExascaleComputing,JSR-12-310,JASON,MITRECorporation,April2013.

ArchitecturalCharacteristic Simulation Analytics

Computation Memoryaddressgenerationdominated Same

Primarymemory Lowpower,highbandwidth,semi-randomaccess Same

Secondarymemory Emergingtechnologiesmayoffsetcost,allowingmuchmorememory …require extremelylargememoryspaces

Storage Integrationofanotherlayerofmemoryhierarchytosupportcheckpoint/restart …tosupportout-of-coredatasetaccess

Interconnecttechnology Highbisectionbandwidth,(forrelativelycoarse-grainedaccess) …(forfine-grainedaccess)

Systemsoftware(node-level)

Lowdependenceonsystemservices,increasinglyadaptive,resourcemanagementforstructured parallelism

…highlyadaptive,resourcemanagementforunstructured parallelism

Systemsoftware(system-level) Increasinglyirregularworkflows Irregularworkflows

Emergingarchitecturalandsystemsoftwaresynergies

Similarneeds:

Q3:Howmightcoherencebefurtheredinpractice?

17

§ Makingitanelementofnationalstrategy§ CheckviatheNSCI

§ Buildingthisintoexascale computingefforts§ AlsoacomponentoftheNSCI

§ Communicatingwithandenlistingthetechnicalcommunitiesconcerned§ Thisforumandsimilarevents

§ Furtherdevelopingthevision§ Today’sdialoguesession!

Aunifyingvisionforsimulationandanalytics

FromTheFourthParadigm:Data-IntensiveScientificDiscoverybyJimGray

Dataanalysiscomplementstheory,experiment,andcomputation

Acknowledgements

19

Additionalreferences

20

§ TheEconomist,“Data,Data,Everywhere,” Feb25th,2010

§ R.C.MurphyandP.M.Kogge,“OntheMemoryAccessPatternsofSupercomputerApplications:BenchmarkSelectionandItsImplications,”IEEETransactionsonComputers56(7,July2007):937–945.

§ R.Murphy,“PowerIssues,”presentationtoJASON2012,June2012.

§ PeterKogge (editor)etal.,ExaScale ComputingStudy:TechnologyChallengesinAchievingExascaleSystems. DARPA,2008.

§ DanMcMorrow,TechnicalChallengesofExascaleComputing,JSR-12-310,JASON,MITRECorporation,April2013.

§ TonyHey,StewartTansley,andKristinTolle(editors), TheFourthParadigm:Data-IntensiveScientificDiscovery,MicrosoftResearch,2009.

§ JimGray,TheFourthParadigm:Data-IntensiveScientificDiscovery

Suggestedquestionsforbreakoutdialogue

21

§ Whywouldincreasingthecoherencebetweenthetechnologybaseusedforsimulationandthatforanalyticsbringvalueinthecontextofyourwork?

§ Whatresearchanddevelopmentwouldbestsupportdevelopmentofamorecommoncomponentroadmapanddesignprinciplesbridgingsimulationandanalytics?

§ Howwouldthisresearchbebestorganized?

22

SupplementaryMaterial

GraphmatchingexampleofdataanalyticsAkeyanalyticprimitive-- usedtofindaspecificinstanceofanabstractpatternofinterest

FromCoffman,Greenblatt,andMarcus,Graph-BasedTechnologiesforIntelligenceAnalysis, CommunicationsoftheACM,47,March2004.

Recommended