Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
BigDatasoftwareinfrastructuresInstitutefor ComputerScienceandControl,Hungarian Academy ofSciences (MTASZTAKI)
GáborHermannZoltánZvaraAndrásBenczúrInformatics Laboratory„BigData– Momemtum”research group
MTASZTAKI,10/11/2016
2
Outline
n Motivationfor“BigData”n “BigData”isaboutsoftwareinfrastructuren Batchandstreamingapproachesn WhatwedoatSZTAKIn Solvingaproblem:handlingdataskew
3
Somedataaboutdata
n Google was guesstimated to have over1Mmachines in2012
n LargeHadronCollider(LHC)collisionsgeneratedabout75petabytesinpast3 years
n Facebook31.25millionmessageseveryminute(2015)
4
TheProjectTriangle
n Smallmachinetostoreandprocessdata:slown Largeserversarefastbutcostlyn Distributedcomputing?
5
Problemwithdistributeddataprocessing
n Whenfailureshappen…n Leftside will not know aboutnew data on right
n Immediate response fromleftsidemightgiveincorrectanswer
n Iffailurescanhappen(partitioned)wecaneitherchoosecorrectness(consistence)orfastresponse(availability)
6
CAP(Fox&Brewer)Theorem
C
A P
Theorem: You may choose two of C-A-PConsistency(Good)
Availability(Fast)
Partition-resilience(Cheap)AP: some replicas may give
erroneous answer
7
Failuresdohappenindistributeddataprocessing!
n WemusthaveP,needtochoosebetweenAandC
n Fastresponsevs.correctresultsn Mostapplicationsneedfastresponsen Bestwecando:eventualconsistency
(ifconnectionresumesanddatacanbeexchanged)
n „BigData” today ismostly about softwareinfrastructuren Tryingtodothebest
8
Approach1:batchprocessing
n Processthewholedatasetn Consistent,buttakestime(hours,days)n Iffailurehappens,waitforrecovery(choosingCP)
n ApacheHadoopn MapReduce,HDFS(distributedfilesystem)n Datainchunksacrossmanymachines,replicatedn Bringcomputationclosetothedata
9
BeyondHadoop andMapReduce (batch)
n MapReduce hasthefirstopensourcedistributedsoftware,Hadoopn Limitationsn Joinandmorecomplexprimitivesn Graphs,machinelearning
n Alternativesn Graphprocessing:ApacheGiraph,ApacheHAMA,…n Inmemorybased:ApacheSparkn Streamingdataflowengine:ApacheFlink
10
Approach2:streamprocessing
n Continuouslyprocessallincomingdatan Fasterresponsetime(low-latency,within1sec)n Iffailure:waitforrecoveryn StillchoosingPC,butsophisticatedrecoverymechanismsgive
lowerdowntime
n Hardertoimplementandreasonaboutn Streamprocessingframeworksn ApacheFlink,ApacheStorm, ApacheSpark
11
STREAMLINEH2020
NewinitiativeontopofApacheFlinkA general data processing framework to unify batch and stream processing
At SZTAKI: Machine Learning
n DFKI(DE)n SICS(SE)n PortugalTelecom(PT)n InternetMemory(FR)n Rovio(FI)n SZTAKI(HU) B.– Volker Markl (TUBerlin)
12
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,Machine LearningRelational, …
Low latencywindowing, aggregations, ...
Event logs
Real-time data streams
Batchandstream:same execution engineAn engine that puts equal emphasis to
streaming and batch
13
Whatwedofor“BigData”atSZTAKI
n DevelopingApacheFlink (STREAMLINEH2020)n MachineLearningalgorithms,experimenting
n Projectswithindustrialpartnersn UsingSpark,Flink,Cassandra,Hadoop etc.
n Researchn Improvementsoncurrentsystemsn Ongoingproject:handlingdataskew
14
Solvingaproblem:handlingdataskew
n Wehavedevelopedanapplicationaggregatingtelco datan Afterawhile,onrealdatasetitcouldbecomesloworeven
crashn Investigatedtheproblem:dataskewn 80%ofthetrafficgeneratedby20%ofthecommunication
towers
15
Theproblem
n Defaulthashingisnotgoingtodistributethedatauniformly
n Datadistributionisnotknowninadvancen Theheavykeysmightevenchange
stageboundary & shuffle
even partitioning skewed data
slow task
slow task
16
Oursolution:DynamicRepartitioning
n Monitoringtasks,repartitioningbasedonthatdatan Systemawaren Nosignificantoverhead
n Canhandlearbitrarydatadistributionsn Doeseverythingon-the-flyn Worksforstreamingandbatch
n Pluggablen InitiallyonSparkbatchandstreamingn PluggedintoFlink streaming
17
Execution visualization ofSpark jobs
18
Futureinhandlingdataskew
n Generalizingtheproblemn Balancingloadin aprocessingsystemn Balancingresourcesbetween processingsystemsona
cluster(YARN)
19
Conclusion
n Wecantame“BigData”withbettersoftwaren Lottodo…n Connectionbetweenbatchandstreamingn Highlyscalablemachinelearning(batchandonlineboth)n Optimizations(e.g.handlingdataskew)n Betterunderstandingourtools(e.g.visualization)n ...
n Evolvingfast,butwecantakepartinit
21
References
n STREAMLINEH2020n https://streamline.sics.se/
n DynamicRepartitioningn https://spark-summit.org/2016/events/handling-data-skew-
adaptively-in-spark-using-dynamic-repartitioning/
n Visualizationsn http://flink-forward.org/kb_sessions/advanced-visualization-of-
flink-and-spark-jobs/
22
References(continued)
n ApacheHadoopn https://hadoop.apache.org/
n ApacheFlinkn https://flink.apache.org/
n ApacheSparkn https://spark.apache.org/
23
References(continued)
n E.A.Brewer.Towardsrobustdistributedsystems,2000n J.DeanandS.Ghemawat.MapReduce:SimplifiedData
ProcessingonLargeClusters,2004n N.Marz.HowtobeattheCAPtheorem,2011n http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html