Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Data Management for Data Science
Master of Science in Data Science
Facoltà di Ing. dell'Informazione, Informatica e Statistica Sapienza Università di Roma
AA 2018/2019
Domenico Lembo Dipartimento di Ingegneria Informatica,
Automatica e Gestionale A. Ruberti
An Introduction to Big Data
AvailabilityofMassiveData
• Digitaldataarenowadayscollectedatanunprecedentscaleandinverymanyformatsinavarietyofdomains(e-commerce,socialnetworks,sensornetworks,astronomy,genomics,medicalrecords,etc.)
• Thisishasbeenmadepossiblebytheincrediblegrowthinrecentyearsofthecapacityofdatastoragetoolsandofthecomputingpowerofelectronicdevices,aswellasbytheadventofmobileandpervasivecomputing,cloudcomputing,andcloudstorage.
ExploitabilityofMassiveData
• Howtotransformavailabledataintoinformation,andhowtomakeorganizations’businesstotakeadvantagesofsuchinformationarelong-standingproblemsinIT,andinparticularininformationmanagementandanalysis.
• Theseissueshavebecomemoreandmorechallengingandcomplexinthe“BigData”era
• Atthesametime,facingthechallengecanbeevenmoreworthythaninthepast,sincethemassiveamountofdatathatisnowavailablemayallowforanalyticalresultsneverachievedbefore
Becareful!• “Bigdataisavaguetermforamassive
phenomenonthathasrapidlybecomeanobsessionwithentrepreneurs,scientists,governmentsandthemedia”(TimHarford,journalistandeconomist,March,2014)*
*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu
Moore'sLawfor#BigData:Theamountofnonsensepackedintotheterm"BigData"doublesapproximatelyeverytwoyears(MikePluta,DataArchitect,onTwitterAugust2014).https://twitter.com/mikepluta/status/502878691740090369
TheGoogleFluTrends*
• 2008:Googlepeoplepublish“Detectinginfluenzaepidemicsusingsearchenginequerydata”onnature(https://www.nature.com/articles/nature07634).
• TheywereabletotrackthespreadofinfluenzaacrosstheUSmorequicklythantheUSCentersforDiseaseControlandPrevention(CDC).
• Thetrackingwasessentiallybasedoncorrelationbetweenwhatpeoplesearchedforonlineandwhethertheyhadflusymptoms.
• Fouryearslater,withasimilarexperimentsGooglepeopleoverstimatedthespreadofinfluenzabyalmostafactoroftwo!
• “theory-freeanalysisofmerecorrelationsisinevitablyfragileifyouhavenoideawhatisbehindacorrelation”.
*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu
CorrelationvsCausality!
UnderstandingBigDataisinfactdifficult!
“Therearealotofsmalldataproblemsthatoccurinbigdata.Theydon’tdisappearbecauseyou’vegotlotsofthestuff.Theygetworse!”(DavidSpiegelhalter,CambridgeUniversity)
ThinkingBigData*
"BigData"hasleaptrapidlyintooneofthemosthypedtermsinourindustry,yetthehypeshouldnotblindpeopletothefactthatthisisagenuinelyimportantshiftabouttheroleofdataintheworld.Theamount,speed,andvalueofdatasourcesisrapidlyincreasing.Datamanagementhastochangeinfivebroadareas:extractionofdatafromawiderrangeofsources,changestothelogisticsofdatamanagementwithnewdatabaseandintegrationapproaches,theuseofagileprinciplesinrunninganalyticsprojects,anemphasisontechniquesfordatainterpretationtoseparatesignalfromnoise,andtheimportanceofwell-designedvisualizationtomakethatsignalmorecomprehensible.Summingupthismeanswedon'tneedbiganalyticsprojects,insteadwewantthenewdatathinkingtopermeateourregularwork.”
MartinFowler
*http://martinfowler.com/articles/bigData/
ThinkingBigData
• Thus,roughly,BigDataisdatathatexceedstheprocessingcapacityofconventionaldatabasesystems
• ButalsoBigDataisunderstoodasacapabilitythatallowscompaniestoextractvaluefromlargevolumesofdata
• but,notice,thisdoesnotmeanonlyextremelylarge,massivedatabases
• Besidesdatadimension,whatcharacterizesBigDataarealsotheheterogeneityinthewayinwhichinformationisstructured,thedynamicitywithwhichdatachanges,andtheabilityofquicklyprocessingit
• Thiscallsfornewcomputingparadigmsorframeworks,notonlyadvanceddatastoragemechanisms
TheThreeVs
TocharacterizeBigData,threeVsareused,whicharetheVsof
– Volume
– Velocity– Variety
Volume• Bigdataapplicationsarecharacterizedofcoursebybigamountsofdata,
wherebigmeansextremelylarge,e.g.,morethanaterabyte(TB)orpetabyte(PB),ormore.
• Therearevariouscontextsinwhichthesedimensionscanbeeasilyreached:chattersfromsocialnetworks,webserverlogs,trafficflowsensors,satelliteimagery,broadcastaudiostreams,bankingtransactions,GPStrails,financialmarketdata,biologicaldata,etc.
• Somemoreconcreteexamples:– DespitesomeYoutubestatisticsareavailable1thetotalstoragecapacity
ofYoutubeit’snotknown,butrealisticallyitshouldbenolessthan1EB(2016)
– NSAdatacenter:estimatedstoragecapacityofatleast2,000PBs(2013)2
– Facebook:300PBdatawarehouse(2014)31http://web.archive.org/web/20150217015601/http://www.youtube.com/yt/press/statistics.html2http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/#1b66cc7c3cd23https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Volume• Howmanydataintheworld?
AccordingtoIDC(InternationalDataCorporation):– 800Terabytes,2000– 160Exabytes,2006(1EB=1018B)– 500Exabytes,2009– 1.8Zettabytes,2011(1ZB=1021B)1– 2.8Zettabytes,20121– 4.4Zettabytes,2013– 175Zettabytesby20252(estimate)
1http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html2https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf3https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
Around90%ofworld’sdatageneratedinthelast4years.
Thedigitaluniverseisdoublinginsize
everytwoyears3
Multipleofbitsorbytes
Symbol Name Decimalvalue
BinaryValue
k kilo 1000 1024
M mega 10002 10242
G giga 10003 10243
T tera 10004 10244
P peta 10005 10245
E exa 10006 10246
Z zetta 10007 10247
Y yotta 10008 10248
Volume
• Thesheervolumeofdataisenoughtodefeatmanylong-followedapproachestodatamanagement
• Traditionalcentralizeddatabasesystemscannothandlemanyofthedatavolumes,forcingtheuseofclusters
• Datahavetobenecessarilydistributed,andthenumberofsourcesprovidinginformationcanbehuge,muchhigherthanthenumberconsideredintraditionaldataintegrationandvirtualizationsystems
Velocity
• Data’svelocity(i.e.,therateatwhichdataiscollectedandmadeavailableintoanorganization)hasfollowedasimilarpatterntothatofvolume
• Manydatasourcesaccessedbyorganizationsfortheirbusinessareextremelydynamic
• Mobiledevicesincreasetherateofdatainflow:data“everywhere”,collectedandconsumedcontinuously
Velocity• Someexamples:
– Walmart:1milliontransactionperhour(2010)1– eBay:datathroughputreaches100PBsperday(2013)2– Googleprocesses100PBsperday(2013-14)3– Facebook:600TBaddedtothewarehouseeveryday(2014)4– 6000-8000tweetspersecondeveryday(in2019)5
• In2013,ithasbeenestimatedthateveryminuteofeverydaywecreated6:- Morethan204millionemailmessages- 571newWebsitesand347blogpostscreated- 72hoursofnewYouTubevideos- 1.8millionsoflikeonFacebook- 216.000newphotosoninstagram- $83.000spentonAmazon
1http://martinfowler.com/articles/bigData/2http://www.v3.co.uk/v3-uk/news/2302017/ebay-using-big-data-analytics-to-drive-up-price-listings3http://www.slideshare.net/kmstechnology/big-data-overview-2013-20144https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/5http://www.internetlivestats.com/twitter-statistics/6http://www.dailymail.co.uk/sciencetech/article-2381188/Revealed-happens-just-ONE-minute-internet-216-000-photos-posted-278-000-Tweets-1-8m-Facebook-likes.html
Velocity• Processinginformationassoonasitisavailable,thusspeedingthe
“feedbackloop”,canprovidecompetitiveadvantages• SomeexamplesofFastDataProcessing:
– CustomerExperience/Retail:onlineretailsthatareabletosuggestadditionalproductstoacustomerateverynewinformationinsertedduringanon-linepurchase(Click-streamanalysis)
– FinancialServicesIndustry:Algorithmictradingusingeventprocesstechnologybutalsoreal-timedataintegrationandanalytics
– Telecommunication:understandallocationofnetworkresourcesbasedontrafficandapplicationrequirements,networkusagepatterns
– Energy:real-timeprocessofhighvolumeofeventstomakeimportantdecisionsinordertoeffectivelyandefficientlymanagepossiblefaultsonthedistributionnetwork
– Manufacturing:analyzereal-timemetricstotakecorrectiveactionbeforeafailureoccurs
Velocity
• Streamprocessingisanewchallengingcomputingparadigm,whereinformationisnotstoredforlaterbatchprocessing,butisconsumedonthefly
• Thisisparticularlyusefulwhendataaretoofasttostorethementirely(forexamplebecausetheyneedsomeprocessingtobestoredproperly),asinscientificapplications,orwhentheapplicationrequiresanimmediateanswer
Variety
• Dataisextremelyheterogeneous:e.g.,intheformatinwhicharerepresented,butalsoandinthewaytheyrepresentinformation,bothattheintensionalandextensionallevel
• E.g.,textfromsocialnetworks,sensordata,logsfromwebapplications,databases,XMLdocuments,RDFdata,etc.
• Dataformatrangesthereforefromstructured(e.g,relationaldatabases)tosemistructured(e.g.,XMLdocuments),tounstructured(e.g.,textdocuments)
Variety
• Asforunstructureddata,forexample,thechallengeistoextractmeaningforconsumptionbothbyhumansormachines
• Entityresolution,whichistheprocessthatresolves,i.e.,identifies,entitiesanddetectsrelationships,thenplaysanimportantrole
• Infact,thesearewell-knownissuesstudiedsinceseveralyearsinthefieldsofdataintegration,dataexchange,anddataquality.IntheBigDatascenario,however,theybecomeevenmorechallenging
AfourthV:Veracity*
• Dataareofwidelydifferentquality
• Traditionallydataisthoughtofascomingfromwellorganizeddatabaseswithcontrolledschemas
• Instead,in“BigData”thereisoftenlittleornoschematocontroltheirstructure
• Theresultisthatthereareseriousproblemswiththequalityofthedata
*TheliteratureoftenmentionsonlythreeVsanddoesnotincludeveracity.
HoweversomeauthorstendtoincludeveracityasacorecharacteristcofBigData(alternatively,veracityisconsideredanaspectofvariety)
BigData:V3+Value
BigDatacangeneratehugecompetitiveadvantages!
ThevalueofDatafororganizations
• Althoughit'sdifficulttogethardfiguresonthevalueofmakingfulluseofyourdata,muchofthesuccessofcompaniessuchasAmazonandGoogleiscreditedtotheireffectiveuseofdata1
• Thuscompaniesspendlargeamountsofmoneytoreachthiseffectiveuse:AccordingtoIDC,in2017bigdataandanalyticssoftwaremarketreached$54.1billionwordlwide,anditisexpectedtogrowatafive-yearCAGR(compoundannualgrowthrate)of11.2%.(analysis2018-2022)2
• ThusvariousBigDatasolutionsarenowpromotedbyallmajorvendorsindatamanagementsystems
1http://martinfowler.com/articles/bigData/2https://www.idc.com/getdoc.jsp?containerId=US44243318
Potentialvalue
Demandfornewdatamanagementsolutions*
• Inthescenarioswedepicteditisnotsurprisingthatnewdatamangementsolutionsaredemanded
• Indeed,despitethepopularityandwellunderstoodnatureofrelationaldatabases,itisnotthecasethattheyshouldalwaysbethedestinationfordata
• Dependingonthecharacteristicofdata,certainclassesofdatabasesaremoresuitedthanothersfortheirmanagement
• XMLdocumentsaremoreversatilewhenstoredindedicatedXMLstoragesystems(e.g.,MarkLogic)
• SocialnetworkrelationsaregraphbynatureandgraphdatabasessuchasNeo4Jcanmakeoperationsonthemsimplerandmoreefficient
*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam
Demandfornewdatamanagementsolutions*
• Adisadvantageoftherelationaldatabaseisthestaticnatureofitsschema
• Inanagileenvironment,theresultsofcomputationwillevolvewiththedetectionandextractionofnewinformation
• Semi-structuredNoSQLdatabasesmeetthisneedforflexibility:theyprovidesomestructuretoorganizedata(enoughforcertainapplications),butdonotrequiretheexactschemaofthedatabeforestoringit
*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam
NoSQLdatabases*
Orbetter…notonlySQL• Theterm"NoSQL"isveryill-defined.It'sgenerallyappliedtoa
numberofnon-relationaldatabasessuchasCassandra,Mongo,Dynamo,Neo4J,Riak,andmanyothers
• Theyembraceschemalessdata,runonclusters,andhavetheabilitytotradeofftraditionalconsistencyforotherusefulproperties
• AdvocatesofNoSQLdatabasesclaimthattheycanbuildsystemsthataremoreperformant,scalemuchbetter,andareeasiertoprogramwith
*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)
Graphdatabases
Key-valuesdatabases
Documentdatabases
ColumnFamilyDatabases
NoSQLdatabases*
• Isthisthefirstrattleofthedeathknellforrelationaldatabases,oryetanotherpretendertothethrone?Ouranswertothatis"neither"
• Relationaldatabasesareapowerfultoolthatweexpecttobeusingformanymoredecades,butwedoseeaprofoundchangeinthatrelationaldatabaseswon'tbetheonlydatabasesinuse
• OurviewisthatweareenteringaworldofPolyglotPersistencewhereenterprises,andevenindividualapplications,usemultipletechnologiesfordatamanagement
*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)
Multipletechnologiesfordatamanagement
Asanexercise,letusaskgooglewhichisthedatabaseengineusedbyFacebook.Wegetthefollowingtools1:• MySQLascoredatabaseengine(infactacustomizedversion
ofMySQL,highlyoptimizedanddistributed)2• Cassandra(anApacheopensourcefaulttolerantdistributed
NoSQLDBMS,originallydevelopedatFacebookitself)asdatabasefortheInobxmailsearch
• Memcached,amemorycachingsystemtospeedupdynamicdatabasedrivenwebsites
• HayStack,forstorageandmanagementofphotos• Hive,anopensource,peta-bytescaledatawarehousing
frameworkbasedonHadoop,foranalytics,andalsoPresto,anexabytescaledatawarehouse3
1https://www.techworm.net/2013/05/what-database-actually-facebook-uses.html
2http://www.datacenterknowledge.com/data-center-faqs/facebook-data-center-faq-page-23http://prestodb.io/
DataWarehouse• Adatawarehouseisadatabaseusedforreportinganddata
analysis.Itisacentralrepositoryofdatawhichiscreatedbyintegratingdatafromoneormoredisparatesources
• AccordingtoInmon*,adatawarehouseis:– Subject-oriented:Thedatainthedatawarehouseisorganizedsothat
allthedataelementsrelatingtothesamereal-worldeventorobjectarelinkedtogether
– Non-volatile:Datainthedatawarehouseareneverover-writtenordeletedoncecommitted,thedataarestatic,read-only,andretainedforfuturereporting
– Integrated:Thedatawarehousecontainsdatafrommostorallofanorganization'soperationalsystemsandthesedataaremadeconsistent
– Time-variant:Foranoperationalsystem,thestoreddatacontainsthecurrentvalue.Thedatawarehouse,however,containsthehistoryofdatavalues
*Inmon,Bill(1992).BuildingtheDataWarehouse.Wiley
DataWarehousevs.BigData• AreDataWarehouses(DWs)underthehatofBigData?
• Thenotionofdatawarehousingdatesbacktotheendof80s,andverymanydatawarehouseandbusinessintelligencesolutionshavebeenproposedsincethen
• BTW,BigDataandDWshavemanypointsincommon,atleastw.r.t.– Volume:datawarehousesstorelargeamountsofdata,– Variety:atleastinprinciple,datawarehousesintegrate
heterogeneousinformation– Veracity:datawarehosesusuallyareequippedwithdatacleaning
solutions,appliedintheso-calledextract-transformation-load(ETL)phase
DataWarehousevs.BigData
• Existingenterprisedatawarehousesandrelationaldatabasesexcelatprocessingstructureddata,andcanstoremassiveamountsofdata,thoughatcost
• However,thisrequirementforstructureimposesaninertiathatmakesdatawarehousesunsuitedforagileexplorationofmassiveheterogenousdata
• Theamountofeffortrequiredtowarehousedataoftenmeansthatvaluabledatasourcesinorganizationsarenevermined
• Therefore,newcomputingmodelsandframeworksareneededtomakenewDWsolutionscompliantwiththeBigDataecosystem.
MapReduce
• MapReduceisaprogrammingframeworkforparallelizingcomputation
• OriginallydefinedatGoogle
• Next,therehavebeenvariousimplementations
• Awell-knownopensourcedistributionisApacheHadoop
MapReduce
AMapReduceprogramisconstitutedbytwocomponents• Map()procedure(themapper)thatperformsfilteringand
sorting(itdecomposestheproblemintoparallelizablesubproblems)
• Reduce()procedure(thereducer)devotedtosolvesubproblems
TheMapReduceFrameworkmanagesdistributedservers,whichexecutethevarioussubtasksinparallel,andcontrolscommunicationanddatatransfersbetweenthevariousservers,aswellasguaranteesfaulttoleranceanddisasterrecovery.