Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

BigDatasoftwareinfrastructuresInstitutefor ComputerScienceandControl,Hungarian Academy ofSciences (MTASZTAKI)

GáborHermannZoltánZvaraAndrásBenczúrInformatics Laboratory„BigData– Momemtum”research group

MTASZTAKI,10/11/2016

2

Outline

n Motivationfor“BigData”n “BigData”isaboutsoftwareinfrastructuren Batchandstreamingapproachesn WhatwedoatSZTAKIn Solvingaproblem:handlingdataskew

3

Somedataaboutdata

n Google was guesstimated to have over1Mmachines in2012

n LargeHadronCollider(LHC)collisionsgeneratedabout75petabytesinpast3 years

n Facebook31.25millionmessageseveryminute(2015)

4

TheProjectTriangle

n Smallmachinetostoreandprocessdata:slown Largeserversarefastbutcostlyn Distributedcomputing?

5

Problemwithdistributeddataprocessing

n Whenfailureshappen…n Leftside will not know aboutnew data on right

n Immediate response fromleftsidemightgiveincorrectanswer

n Iffailurescanhappen(partitioned)wecaneitherchoosecorrectness(consistence)orfastresponse(availability)

6

CAP(Fox&Brewer)Theorem

C

A P

Theorem: You may choose two of C-A-PConsistency(Good)

Availability(Fast)

Partition-resilience(Cheap)AP: some replicas may give

erroneous answer

7

Failuresdohappenindistributeddataprocessing!

n WemusthaveP,needtochoosebetweenAandC

n Fastresponsevs.correctresultsn Mostapplicationsneedfastresponsen Bestwecando:eventualconsistency

(ifconnectionresumesanddatacanbeexchanged)

n „BigData” today ismostly about softwareinfrastructuren Tryingtodothebest

8

Approach1:batchprocessing

n Processthewholedatasetn Consistent,buttakestime(hours,days)n Iffailurehappens,waitforrecovery(choosingCP)

n ApacheHadoopn MapReduce,HDFS(distributedfilesystem)n Datainchunksacrossmanymachines,replicatedn Bringcomputationclosetothedata

9

BeyondHadoop andMapReduce (batch)

n MapReduce hasthefirstopensourcedistributedsoftware,Hadoopn Limitationsn Joinandmorecomplexprimitivesn Graphs,machinelearning

n Alternativesn Graphprocessing:ApacheGiraph,ApacheHAMA,…n Inmemorybased:ApacheSparkn Streamingdataflowengine:ApacheFlink

10

Approach2:streamprocessing

n Continuouslyprocessallincomingdatan Fasterresponsetime(low-latency,within1sec)n Iffailure:waitforrecoveryn StillchoosingPC,butsophisticatedrecoverymechanismsgive

lowerdowntime

n Hardertoimplementandreasonaboutn Streamprocessingframeworksn ApacheFlink,ApacheStorm, ApacheSpark

11

STREAMLINEH2020

NewinitiativeontopofApacheFlinkA general data processing framework to unify batch and stream processing

At SZTAKI: Machine Learning

n DFKI(DE)n SICS(SE)n PortugalTelecom(PT)n InternetMemory(FR)n Rovio(FI)n SZTAKI(HU) B.– Volker Markl (TUBerlin)

12

Flink

Historic data

Kafka, RabbitMQ, ...

HDFS, JDBC, ...

ETL, Graphs,Machine LearningRelational, …

Low latencywindowing, aggregations, ...

Event logs

Real-time data streams

Batchandstream:same execution engineAn engine that puts equal emphasis to

streaming and batch

13

Whatwedofor“BigData”atSZTAKI

n DevelopingApacheFlink (STREAMLINEH2020)n MachineLearningalgorithms,experimenting

n Projectswithindustrialpartnersn UsingSpark,Flink,Cassandra,Hadoop etc.

n Researchn Improvementsoncurrentsystemsn Ongoingproject:handlingdataskew

14

Solvingaproblem:handlingdataskew

n Wehavedevelopedanapplicationaggregatingtelco datan Afterawhile,onrealdatasetitcouldbecomesloworeven

crashn Investigatedtheproblem:dataskewn 80%ofthetrafficgeneratedby20%ofthecommunication

towers

15

Theproblem

n Defaulthashingisnotgoingtodistributethedatauniformly

n Datadistributionisnotknowninadvancen Theheavykeysmightevenchange

stageboundary & shuffle

even partitioning skewed data

slow task

slow task

16

Oursolution:DynamicRepartitioning

n Monitoringtasks,repartitioningbasedonthatdatan Systemawaren Nosignificantoverhead

n Canhandlearbitrarydatadistributionsn Doeseverythingon-the-flyn Worksforstreamingandbatch

n Pluggablen InitiallyonSparkbatchandstreamingn PluggedintoFlink streaming

17

Execution visualization ofSpark jobs

18

Futureinhandlingdataskew

n Generalizingtheproblemn Balancingloadin aprocessingsystemn Balancingresourcesbetween processingsystemsona

cluster(YARN)

19

Conclusion

n Wecantame“BigData”withbettersoftwaren Lottodo…n Connectionbetweenbatchandstreamingn Highlyscalablemachinelearning(batchandonlineboth)n Optimizations(e.g.handlingdataskew)n Betterunderstandingourtools(e.g.visualization)n ...

n Evolvingfast,butwecantakepartinit

20

Questions?

n Youcanreachus!n [email protected] [email protected] [email protected]

21

References

n STREAMLINEH2020n https://streamline.sics.se/

n DynamicRepartitioningn https://spark-summit.org/2016/events/handling-data-skew-

adaptively-in-spark-using-dynamic-repartitioning/

n Visualizationsn http://flink-forward.org/kb_sessions/advanced-visualization-of-

flink-and-spark-jobs/

22

References(continued)

n ApacheHadoopn https://hadoop.apache.org/

n ApacheFlinkn https://flink.apache.org/

n ApacheSparkn https://spark.apache.org/

23

References(continued)

n E.A.Brewer.Towardsrobustdistributedsystems,2000n J.DeanandS.Ghemawat.MapReduce:SimplifiedData

ProcessingonLargeClusters,2004n N.Marz.HowtobeattheCAPtheorem,2011n http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

Documents

Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache