37
Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione, Informatica e Statistica Sapienza Università di Roma AA 2018/2019 Domenico Lembo Dipartimento di Ingegneria Informatica, Automatica e Gestionale A. Ruberti An Introduction to Big Data

An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Data Management for Data Science

Master of Science in Data Science

Facoltà di Ing. dell'Informazione, Informatica e Statistica Sapienza Università di Roma

AA 2018/2019

Domenico Lembo Dipartimento di Ingegneria Informatica,

Automatica e Gestionale A. Ruberti

An Introduction to Big Data

Page 2: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

AvailabilityofMassiveData

•  Digitaldataarenowadayscollectedatanunprecedentscaleandinverymanyformatsinavarietyofdomains(e-commerce,socialnetworks,sensornetworks,astronomy,genomics,medicalrecords,etc.)

•  Thisishasbeenmadepossiblebytheincrediblegrowthinrecentyearsofthecapacityofdatastoragetoolsandofthecomputingpowerofelectronicdevices,aswellasbytheadventofmobileandpervasivecomputing,cloudcomputing,andcloudstorage.

Page 3: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

ExploitabilityofMassiveData

•  Howtotransformavailabledataintoinformation,andhowtomakeorganizations’businesstotakeadvantagesofsuchinformationarelong-standingproblemsinIT,andinparticularininformationmanagementandanalysis.

•  Theseissueshavebecomemoreandmorechallengingandcomplexinthe“BigData”era

•  Atthesametime,facingthechallengecanbeevenmoreworthythaninthepast,sincethemassiveamountofdatathatisnowavailablemayallowforanalyticalresultsneverachievedbefore

Page 4: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Becareful!•  “Bigdataisavaguetermforamassive

phenomenonthathasrapidlybecomeanobsessionwithentrepreneurs,scientists,governmentsandthemedia”(TimHarford,journalistandeconomist,March,2014)*

*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu

Moore'sLawfor#BigData:Theamountofnonsensepackedintotheterm"BigData"doublesapproximatelyeverytwoyears(MikePluta,DataArchitect,onTwitterAugust2014).https://twitter.com/mikepluta/status/502878691740090369

Page 5: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

TheGoogleFluTrends*

•  2008:Googlepeoplepublish“Detectinginfluenzaepidemicsusingsearchenginequerydata”onnature(https://www.nature.com/articles/nature07634).

•  TheywereabletotrackthespreadofinfluenzaacrosstheUSmorequicklythantheUSCentersforDiseaseControlandPrevention(CDC).

•  Thetrackingwasessentiallybasedoncorrelationbetweenwhatpeoplesearchedforonlineandwhethertheyhadflusymptoms.

•  Fouryearslater,withasimilarexperimentsGooglepeopleoverstimatedthespreadofinfluenzabyalmostafactoroftwo!

•  “theory-freeanalysisofmerecorrelationsisinevitablyfragileifyouhavenoideawhatisbehindacorrelation”.

*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu

Page 6: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

CorrelationvsCausality!

Page 7: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

UnderstandingBigDataisinfactdifficult!

“Therearealotofsmalldataproblemsthatoccurinbigdata.Theydon’tdisappearbecauseyou’vegotlotsofthestuff.Theygetworse!”(DavidSpiegelhalter,CambridgeUniversity)

Page 8: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

ThinkingBigData*

"BigData"hasleaptrapidlyintooneofthemosthypedtermsinourindustry,yetthehypeshouldnotblindpeopletothefactthatthisisagenuinelyimportantshiftabouttheroleofdataintheworld.Theamount,speed,andvalueofdatasourcesisrapidlyincreasing.Datamanagementhastochangeinfivebroadareas:extractionofdatafromawiderrangeofsources,changestothelogisticsofdatamanagementwithnewdatabaseandintegrationapproaches,theuseofagileprinciplesinrunninganalyticsprojects,anemphasisontechniquesfordatainterpretationtoseparatesignalfromnoise,andtheimportanceofwell-designedvisualizationtomakethatsignalmorecomprehensible.Summingupthismeanswedon'tneedbiganalyticsprojects,insteadwewantthenewdatathinkingtopermeateourregularwork.”

MartinFowler

*http://martinfowler.com/articles/bigData/

Page 9: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

ThinkingBigData

•  Thus,roughly,BigDataisdatathatexceedstheprocessingcapacityofconventionaldatabasesystems

•  ButalsoBigDataisunderstoodasacapabilitythatallowscompaniestoextractvaluefromlargevolumesofdata

•  but,notice,thisdoesnotmeanonlyextremelylarge,massivedatabases

•  Besidesdatadimension,whatcharacterizesBigDataarealsotheheterogeneityinthewayinwhichinformationisstructured,thedynamicitywithwhichdatachanges,andtheabilityofquicklyprocessingit

•  Thiscallsfornewcomputingparadigmsorframeworks,notonlyadvanceddatastoragemechanisms

Page 10: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

TheThreeVs

TocharacterizeBigData,threeVsareused,whicharetheVsof

–  Volume

–  Velocity–  Variety

Page 11: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Volume•  Bigdataapplicationsarecharacterizedofcoursebybigamountsofdata,

wherebigmeansextremelylarge,e.g.,morethanaterabyte(TB)orpetabyte(PB),ormore.

•  Therearevariouscontextsinwhichthesedimensionscanbeeasilyreached:chattersfromsocialnetworks,webserverlogs,trafficflowsensors,satelliteimagery,broadcastaudiostreams,bankingtransactions,GPStrails,financialmarketdata,biologicaldata,etc.

•  Somemoreconcreteexamples:–  DespitesomeYoutubestatisticsareavailable1thetotalstoragecapacity

ofYoutubeit’snotknown,butrealisticallyitshouldbenolessthan1EB(2016)

–  NSAdatacenter:estimatedstoragecapacityofatleast2,000PBs(2013)2

–  Facebook:300PBdatawarehouse(2014)31http://web.archive.org/web/20150217015601/http://www.youtube.com/yt/press/statistics.html2http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/#1b66cc7c3cd23https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/

Page 12: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Volume•  Howmanydataintheworld?

AccordingtoIDC(InternationalDataCorporation):–  800Terabytes,2000–  160Exabytes,2006(1EB=1018B)–  500Exabytes,2009–  1.8Zettabytes,2011(1ZB=1021B)1–  2.8Zettabytes,20121–  4.4Zettabytes,2013–  175Zettabytesby20252(estimate)

1http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html2https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf3https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

Around90%ofworld’sdatageneratedinthelast4years.

Thedigitaluniverseisdoublinginsize

everytwoyears3

Multipleofbitsorbytes

Symbol Name Decimalvalue

BinaryValue

k kilo 1000 1024

M mega 10002 10242

G giga 10003 10243

T tera 10004 10244

P peta 10005 10245

E exa 10006 10246

Z zetta 10007 10247

Y yotta 10008 10248

Page 13: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Volume

•  Thesheervolumeofdataisenoughtodefeatmanylong-followedapproachestodatamanagement

•  Traditionalcentralizeddatabasesystemscannothandlemanyofthedatavolumes,forcingtheuseofclusters

•  Datahavetobenecessarilydistributed,andthenumberofsourcesprovidinginformationcanbehuge,muchhigherthanthenumberconsideredintraditionaldataintegrationandvirtualizationsystems

Page 14: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Velocity

•  Data’svelocity(i.e.,therateatwhichdataiscollectedandmadeavailableintoanorganization)hasfollowedasimilarpatterntothatofvolume

•  Manydatasourcesaccessedbyorganizationsfortheirbusinessareextremelydynamic

•  Mobiledevicesincreasetherateofdatainflow:data“everywhere”,collectedandconsumedcontinuously

Page 15: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Velocity•  Someexamples:

–  Walmart:1milliontransactionperhour(2010)1–  eBay:datathroughputreaches100PBsperday(2013)2–  Googleprocesses100PBsperday(2013-14)3–  Facebook:600TBaddedtothewarehouseeveryday(2014)4–  6000-8000tweetspersecondeveryday(in2019)5

•  In2013,ithasbeenestimatedthateveryminuteofeverydaywecreated6:-  Morethan204millionemailmessages-  571newWebsitesand347blogpostscreated-  72hoursofnewYouTubevideos-  1.8millionsoflikeonFacebook-  216.000newphotosoninstagram-  $83.000spentonAmazon

1http://martinfowler.com/articles/bigData/2http://www.v3.co.uk/v3-uk/news/2302017/ebay-using-big-data-analytics-to-drive-up-price-listings3http://www.slideshare.net/kmstechnology/big-data-overview-2013-20144https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/5http://www.internetlivestats.com/twitter-statistics/6http://www.dailymail.co.uk/sciencetech/article-2381188/Revealed-happens-just-ONE-minute-internet-216-000-photos-posted-278-000-Tweets-1-8m-Facebook-likes.html

Page 16: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Velocity•  Processinginformationassoonasitisavailable,thusspeedingthe

“feedbackloop”,canprovidecompetitiveadvantages•  SomeexamplesofFastDataProcessing:

–  CustomerExperience/Retail:onlineretailsthatareabletosuggestadditionalproductstoacustomerateverynewinformationinsertedduringanon-linepurchase(Click-streamanalysis)

–  FinancialServicesIndustry:Algorithmictradingusingeventprocesstechnologybutalsoreal-timedataintegrationandanalytics

–  Telecommunication:understandallocationofnetworkresourcesbasedontrafficandapplicationrequirements,networkusagepatterns

–  Energy:real-timeprocessofhighvolumeofeventstomakeimportantdecisionsinordertoeffectivelyandefficientlymanagepossiblefaultsonthedistributionnetwork

–  Manufacturing:analyzereal-timemetricstotakecorrectiveactionbeforeafailureoccurs

Page 17: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Velocity

•  Streamprocessingisanewchallengingcomputingparadigm,whereinformationisnotstoredforlaterbatchprocessing,butisconsumedonthefly

•  Thisisparticularlyusefulwhendataaretoofasttostorethementirely(forexamplebecausetheyneedsomeprocessingtobestoredproperly),asinscientificapplications,orwhentheapplicationrequiresanimmediateanswer

Page 18: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Variety

•  Dataisextremelyheterogeneous:e.g.,intheformatinwhicharerepresented,butalsoandinthewaytheyrepresentinformation,bothattheintensionalandextensionallevel

•  E.g.,textfromsocialnetworks,sensordata,logsfromwebapplications,databases,XMLdocuments,RDFdata,etc.

•  Dataformatrangesthereforefromstructured(e.g,relationaldatabases)tosemistructured(e.g.,XMLdocuments),tounstructured(e.g.,textdocuments)

Page 19: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Variety

•  Asforunstructureddata,forexample,thechallengeistoextractmeaningforconsumptionbothbyhumansormachines

•  Entityresolution,whichistheprocessthatresolves,i.e.,identifies,entitiesanddetectsrelationships,thenplaysanimportantrole

•  Infact,thesearewell-knownissuesstudiedsinceseveralyearsinthefieldsofdataintegration,dataexchange,anddataquality.IntheBigDatascenario,however,theybecomeevenmorechallenging

Page 20: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

AfourthV:Veracity*

•  Dataareofwidelydifferentquality

•  Traditionallydataisthoughtofascomingfromwellorganizeddatabaseswithcontrolledschemas

•  Instead,in“BigData”thereisoftenlittleornoschematocontroltheirstructure

•  Theresultisthatthereareseriousproblemswiththequalityofthedata

*TheliteratureoftenmentionsonlythreeVsanddoesnotincludeveracity.

HoweversomeauthorstendtoincludeveracityasacorecharacteristcofBigData(alternatively,veracityisconsideredanaspectofvariety)

Page 21: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

BigData:V3+Value

BigDatacangeneratehugecompetitiveadvantages!

Page 22: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

ThevalueofDatafororganizations

•  Althoughit'sdifficulttogethardfiguresonthevalueofmakingfulluseofyourdata,muchofthesuccessofcompaniessuchasAmazonandGoogleiscreditedtotheireffectiveuseofdata1

•  Thuscompaniesspendlargeamountsofmoneytoreachthiseffectiveuse:AccordingtoIDC,in2017bigdataandanalyticssoftwaremarketreached$54.1billionwordlwide,anditisexpectedtogrowatafive-yearCAGR(compoundannualgrowthrate)of11.2%.(analysis2018-2022)2

•  ThusvariousBigDatasolutionsarenowpromotedbyallmajorvendorsindatamanagementsystems

1http://martinfowler.com/articles/bigData/2https://www.idc.com/getdoc.jsp?containerId=US44243318

Page 23: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Potentialvalue

Page 24: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Demandfornewdatamanagementsolutions*

•  Inthescenarioswedepicteditisnotsurprisingthatnewdatamangementsolutionsaredemanded

•  Indeed,despitethepopularityandwellunderstoodnatureofrelationaldatabases,itisnotthecasethattheyshouldalwaysbethedestinationfordata

•  Dependingonthecharacteristicofdata,certainclassesofdatabasesaremoresuitedthanothersfortheirmanagement

•  XMLdocumentsaremoreversatilewhenstoredindedicatedXMLstoragesystems(e.g.,MarkLogic)

•  SocialnetworkrelationsaregraphbynatureandgraphdatabasessuchasNeo4Jcanmakeoperationsonthemsimplerandmoreefficient

*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam

Page 25: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Demandfornewdatamanagementsolutions*

•  Adisadvantageoftherelationaldatabaseisthestaticnatureofitsschema

•  Inanagileenvironment,theresultsofcomputationwillevolvewiththedetectionandextractionofnewinformation

•  Semi-structuredNoSQLdatabasesmeetthisneedforflexibility:theyprovidesomestructuretoorganizedata(enoughforcertainapplications),butdonotrequiretheexactschemaofthedatabeforestoringit

*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam

Page 26: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

NoSQLdatabases*

Orbetter…notonlySQL•  Theterm"NoSQL"isveryill-defined.It'sgenerallyappliedtoa

numberofnon-relationaldatabasessuchasCassandra,Mongo,Dynamo,Neo4J,Riak,andmanyothers

•  Theyembraceschemalessdata,runonclusters,andhavetheabilitytotradeofftraditionalconsistencyforotherusefulproperties

•  AdvocatesofNoSQLdatabasesclaimthattheycanbuildsystemsthataremoreperformant,scalemuchbetter,andareeasiertoprogramwith

*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)

Page 27: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Graphdatabases

Page 28: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Key-valuesdatabases

Page 29: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Documentdatabases

Page 30: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

ColumnFamilyDatabases

Page 31: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

NoSQLdatabases*

•  Isthisthefirstrattleofthedeathknellforrelationaldatabases,oryetanotherpretendertothethrone?Ouranswertothatis"neither"

•  Relationaldatabasesareapowerfultoolthatweexpecttobeusingformanymoredecades,butwedoseeaprofoundchangeinthatrelationaldatabaseswon'tbetheonlydatabasesinuse

•  OurviewisthatweareenteringaworldofPolyglotPersistencewhereenterprises,andevenindividualapplications,usemultipletechnologiesfordatamanagement

*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)

Page 32: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

Multipletechnologiesfordatamanagement

Asanexercise,letusaskgooglewhichisthedatabaseengineusedbyFacebook.Wegetthefollowingtools1:•  MySQLascoredatabaseengine(infactacustomizedversion

ofMySQL,highlyoptimizedanddistributed)2•  Cassandra(anApacheopensourcefaulttolerantdistributed

NoSQLDBMS,originallydevelopedatFacebookitself)asdatabasefortheInobxmailsearch

•  Memcached,amemorycachingsystemtospeedupdynamicdatabasedrivenwebsites

•  HayStack,forstorageandmanagementofphotos•  Hive,anopensource,peta-bytescaledatawarehousing

frameworkbasedonHadoop,foranalytics,andalsoPresto,anexabytescaledatawarehouse3

1https://www.techworm.net/2013/05/what-database-actually-facebook-uses.html

2http://www.datacenterknowledge.com/data-center-faqs/facebook-data-center-faq-page-23http://prestodb.io/

Page 33: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

DataWarehouse•  Adatawarehouseisadatabaseusedforreportinganddata

analysis.Itisacentralrepositoryofdatawhichiscreatedbyintegratingdatafromoneormoredisparatesources

•  AccordingtoInmon*,adatawarehouseis:–  Subject-oriented:Thedatainthedatawarehouseisorganizedsothat

allthedataelementsrelatingtothesamereal-worldeventorobjectarelinkedtogether

–  Non-volatile:Datainthedatawarehouseareneverover-writtenordeletedoncecommitted,thedataarestatic,read-only,andretainedforfuturereporting

–  Integrated:Thedatawarehousecontainsdatafrommostorallofanorganization'soperationalsystemsandthesedataaremadeconsistent

–  Time-variant:Foranoperationalsystem,thestoreddatacontainsthecurrentvalue.Thedatawarehouse,however,containsthehistoryofdatavalues

*Inmon,Bill(1992).BuildingtheDataWarehouse.Wiley

Page 34: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

DataWarehousevs.BigData•  AreDataWarehouses(DWs)underthehatofBigData?

•  Thenotionofdatawarehousingdatesbacktotheendof80s,andverymanydatawarehouseandbusinessintelligencesolutionshavebeenproposedsincethen

•  BTW,BigDataandDWshavemanypointsincommon,atleastw.r.t.–  Volume:datawarehousesstorelargeamountsofdata,–  Variety:atleastinprinciple,datawarehousesintegrate

heterogeneousinformation–  Veracity:datawarehosesusuallyareequippedwithdatacleaning

solutions,appliedintheso-calledextract-transformation-load(ETL)phase

Page 35: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

DataWarehousevs.BigData

•  Existingenterprisedatawarehousesandrelationaldatabasesexcelatprocessingstructureddata,andcanstoremassiveamountsofdata,thoughatcost

•  However,thisrequirementforstructureimposesaninertiathatmakesdatawarehousesunsuitedforagileexplorationofmassiveheterogenousdata

•  Theamountofeffortrequiredtowarehousedataoftenmeansthatvaluabledatasourcesinorganizationsarenevermined

•  Therefore,newcomputingmodelsandframeworksareneededtomakenewDWsolutionscompliantwiththeBigDataecosystem.

Page 36: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

MapReduce

•  MapReduceisaprogrammingframeworkforparallelizingcomputation

•  OriginallydefinedatGoogle

•  Next,therehavebeenvariousimplementations

•  Awell-knownopensourcedistributionisApacheHadoop

Page 37: An Introduction to Big Data - uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdfThinking Big Data* "Big Data" has leapt rapidly into one of the most hyped terms in our industry,

MapReduce

AMapReduceprogramisconstitutedbytwocomponents•  Map()procedure(themapper)thatperformsfilteringand

sorting(itdecomposestheproblemintoparallelizablesubproblems)

•  Reduce()procedure(thereducer)devotedtosolvesubproblems

TheMapReduceFrameworkmanagesdistributedservers,whichexecutethevarioussubtasksinparallel,andcontrolscommunicationanddatatransfersbetweenthevariousservers,aswellasguaranteesfaulttoleranceanddisasterrecovery.