Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Projectacronym: EDSA
Projectfullname: EuropeanDataScienceAcademy
Grantagreementno: 643937
D2.6Learningresources3
DeliverableEditor: AlexanderMikroyannidis(OU)
Othercontributors:
HuwFryer(SOTON),ShathaJaradat(KTH),MihhailMatskin(KTH),AngiVoss(Fraunhofer),RyanGoodman(ODI),DavidTarrant(ODI),EmilyVacher(ODI),GillianWhitworth(ODI)
DeliverableReviewers:
InnaKoval(JSI),AngiVoss(Fraunhofer)
Deliverableduedate: 31/01/2018
Submissiondate: 24/01/2018
Distributionlevel: P
Version: 1.0
ThisdocumentispartofaresearchprojectfundedbytheHorizon2020FrameworkProgrammeoftheEuropeanUnion
ChangeLog
Version Date Amendedby Changes
0.1 14/11/2017 AlexanderMikroyannidis
Outlineandresponsibilitiesofcontributors.
0.2 10/01/2018 AlexanderMikroyannidis
Versionforinternalreview.
0.3 23/01/2018 AlexanderMikroyannidis
Revisedversion.
1.0 24/01/2018 AlexanderMikroyannidis
FinalQA.
D2.6Learningresources3Page3of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
TableofContents
ChangeLog............................................................................................................................................................................................2 TableofContents...............................................................................................................................................................................3 ListofTables........................................................................................................................................................................................4 ListofFigures......................................................................................................................................................................................4 1. Executivesummary...............................................................................................................................................................5 2. Introduction..............................................................................................................................................................................6 3. Programming/ComputationalThinking(RandPython)................................................................................8 3.1 Moduleoverview.........................................................................................................................................................8 3.1.1 LearningObjectives....................................................................................................................................................8 3.1.2 SyllabusandTopicDescriptions.........................................................................................................................8 3.1.3 AlgorithmicThinking................................................................................................................................................8 3.1.4 Computingasanabstraction.................................................................................................................................9 3.1.5 TheWebandtheCloud............................................................................................................................................9 3.1.6 DataManagement....................................................................................................................................................10 3.1.7 PythonProgramming.............................................................................................................................................10 3.2 Learningmaterials&deliverymethods.......................................................................................................10 3.3 Relevancetocurriculum.......................................................................................................................................12 3.4 Relevancetodemandanalysis..........................................................................................................................12 3.5 Furtherdevelopmentplans................................................................................................................................12
4. DataIntensiveComputing..............................................................................................................................................13 4.1 Moduleoverview......................................................................................................................................................13 4.1.1 PartI(Data-IntensiveComputing)SyllabusDescription....................................................................13 4.1.2 PartII(AdvancedTopicsinDistributedSystems)SyllabusDescription....................................15 4.1.3 PartIII(ScalableMachineLearningandDeepLearning)SyllabusDescription......................17 4.2 Learningmaterials&deliverymethods.......................................................................................................19 4.3 Relevancetocurriculum.......................................................................................................................................22 4.4 Relevancetodemandanalysis..........................................................................................................................22 4.5 Furtherdevelopmentplans................................................................................................................................22
5. SocialMediaAnalytics......................................................................................................................................................23 5.1 Moduleoverview......................................................................................................................................................23 5.2 Learningmaterials&deliverymethods.......................................................................................................26 5.3 Relevancetocurriculum.......................................................................................................................................27 5.4 Relevancetodemandanalysis..........................................................................................................................27 5.5 Furtherdevelopmentplans................................................................................................................................28
6. DataExploitationincludingdatamarketsandlicensing................................................................................29
Page4of36EDSAGrantAgreementno.643937
6.1 Moduleoverview......................................................................................................................................................29 6.1.1 Course1:Exploitingdatasciencetocreatevalue...................................................................................29 6.1.2 Course2:Datasciencemeansbusiness........................................................................................................30 6.1.3 Course3:Datascienceecosystems.................................................................................................................30 6.2 Learningmaterials&deliverymethods.......................................................................................................31 6.3 Relevancetocurriculum.......................................................................................................................................31 6.4 Relevancetodemandanalysis..........................................................................................................................32 6.5 Furtherdevelopmentplans................................................................................................................................35
7. Conclusion...............................................................................................................................................................................36
ListofTablesTable1:ThefinalEDSAcurriculum.---------------------------------------------------------------------------------6
ListofFiguresFigure1:SampleofthelearningmaterialsoftheProgramming/ComputationalThinkingmodule.11 Figure2:ScreenshotoftheProgramming/ComputationalThinkingmodulehostedbytheFutureLearnplatform.-------------------------------------------------------------------------------------------------11 Figure3:ScreenshotfromtheIntroductionpresentationofPartI(Data-IntensiveComputing).----19 Figure4:ScreenshotfromtheBigDataStoragepresentationofPartI(Data-IntensiveComputing).20 Figure5:ScreenshotfromthelearningmaterialsofPartII(AdvancedTopicsinDistributedSystems).------------------------------------------------------------------------------------------------------------------------------20 Figure6:ScreenshotfromthelearningmaterialsofPartII(AdvancedTopicsinDistributedSystems).------------------------------------------------------------------------------------------------------------------------------21 Figure7:ScreenshotfromthelearningmaterialsofPartIII(ScalableMachineLearningandDeepLearning).-----------------------------------------------------------------------------------------------------------------21 Figure8:ScreenshotfromthelearningmaterialsofPartIII(ScalableMachineLearningandDeepLearning).-----------------------------------------------------------------------------------------------------------------22 Figure9:ScreenshotfromthelearningmaterialsoftheSocialMediaAnalyticsmodule.--------------26 Figure10:ScreenshotfromthelearningmaterialsoftheSocialMediaAnalyticsmodule.------------27 Figure11:ScreenshotfromthelearningmaterialsoftheSocialMediaAnalyticsmodule.------------27 Figure12:GoogletrendanalysisforBigdataandDataScience.---------------------------------------------32 Figure13:TheEDSAdashboardskillschart.---------------------------------------------------------------------33
D2.6Learningresources3Page5of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
1. ExecutivesummaryThisdeliverablepresentsthemodulesthathavebeenaddedtotheEDSAcoursesportfolioduringthethird year (Y3)of theproject.According to the finalEDSA curriculumpresented inD2.3(M30), thefollowing4moduleswerescheduledforreleaseduringY3:
• Programming/ComputationalThinking(RandPython)• DataIntensiveComputing• SocialMediaAnalytics• DataExploitationincludingdatamarketsandlicensing
Theprojecthasbeenfocusedonbridgingthedatascienceskillsgapviaasupplyoftrainingmaterials.As a result, the EDSA courses portfolio includes a wide range of courses offered by renownededucational institutions both inside and outside the project consortium. These courses are selectedbasedontheirrelevancetotheEDSAcurriculumandtheEDSAdemandanalysis.
Thisdeliverablepresentsthelearningresourcesthathavebeenaddedtotheproject’scoursesportfolioinY3, inorder tocover the topicsof the finalEDSAcurriculumandaddress thecurrentdemandasidentifiedbytheEDSAdemandanalysis.
Page6of36EDSAGrantAgreementno.643937
2. IntroductionTheEDSAcoursesportfolioincorporatesawiderangeofhighqualitylearningresources,eitherofferedbyprojectpartnersorbythirdparties.Thesecoursesareavailableas:
• MassiveOpenOnlineCourses(MOOCs)• Face-to-facecourses• Onlinecourses• Blendedcourses(deliveredface-to-faceandonline)
ThemaincriteriafortheselectionofcoursesforinclusionintheEDSAcoursesportfolioaretheEDSAcurriculumandtheEDSAdemandanalysis.CoursesareselectedbasedontheirpotentialofaddressingtheEDSAcurriculumtopicsaswellasthetrainingneedsofdatascientistsasidentifiedbytheEDSAdemandanalysis.WithregardstocompliancewiththeEDSAdemandanalysis,coursesareevaluatedagainsttherecommendationsoftheStudyEvaluationReport(D1.4).WithregardstocompliancewiththeEDSAcurriculum,coursesareevaluatedagainstthelatestversionofthecurriculumandthetopicsitaddresses.
ThefinalversionoftheEDSAcurriculum,togetherwithadiscussionaboutitsobjectivesandhowthesehavebeenachieved,hasbeenpresented inD2.3and isshown inTable1. In this table,modulesaregroupedbythestagetheybelongto.ThemodulesscheduledforreleaseinY3arehighlightedandarethefollowing:
1. Programming/ComputationalThinking(RandPython)2. DataIntensiveComputing3. SocialMediaAnalytics4. DataExploitationincludingdatamarketsandlicensing
Themodulespresentedinthisdeliverablearethereforecentredaroundtheabove4topics.
Table1:ThefinalEDSAcurriculum.
Topic Stage Schedule AllocatedPartner
FoundationsofDataScience Foundations M6 SOTON
FoundationsofBigData Foundations M6 JSI
Statistical/MathematicalFoundations Foundations M18 JSI
Programming/ComputationalThinking(RandPython)
Foundations M30 SOTON
BigDataArchitecture StorageandProcessing
M6 Fraunhofer
DistributedComputing StorageandProcessing
M6 KTH
D2.6Learningresources3Page7of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
DataManagementandCuration StorageandProcessing
M18 TU/e
LinkedDataandtheSemanticWeb StorageandProcessing
M18 SOTON
DataIntensiveComputing StorageandProcessing
M30 KTH
MachineLearning,DataMiningandBasicAnalytics
Analysis M6 Persontyle
ProcessMining Analysis M6 TU/e
BigDataAnalytics Analysis M18 Fraunhofer
DataVisualisationandStorytelling InterpretationandUse
M18 ODI
SocialMediaAnalytics InterpretationandUse
M30 Fraunhofer
DataExploitationincludingdatamarketsandlicensing
InterpretationandUse
M30 ODI
The remainderof thisdeliverablepresents themodules thathavebeen incorporated into theEDSAcoursesportfolioinY3,basedontheirrelevancetotheEDSAcurriculumandtheEDSAdemandanalysis.Foreachmodule,wepresentanoverviewofitsobjectivesandlearningoutcomes,weidentifythetypesoflearningmaterialsusedandhowthesearedelivered,andwediscusshoweachmoduleisrelatedtothe EDSA curriculum and the demand analysis, as well as the plans in place for their furtherdevelopment.
Page8of36EDSAGrantAgreementno.643937
3. Programming/ComputationalThinking(RandPython)
3.1 ModuleoverviewThisisacourseaboutgettingpeopletothinklikecomputerscientists. Manypeoplethinkcomputerscienceisonlyaboutwritingcode.Whilstthisisapartofit,inthiscourseweaimtoshowmoreoftheprocessesbehindwritingthecode,addingthecodeasanafterthought.WebasethecourseononeofourmodulesattheUniversityofSouthampton,whichweusetogetpeoplefrombackgroundsotherthancomputerscienceuptospeedonourWebScienceprogrammes.WealsousetheworkofWing(2006)1aboutcomputationalthinking.
In addition to being able to focus on thinking like a computer scientist,we alsowish to introducelearnerstothepracticalskillstheyneedtoputthisthinkingintopractice.Assuch,eachweekwehaveincludedinstructionsaboutusingthePythonprogramminglanguagetoapplycorecomputationalandprogrammingconcepts.
ThiscourseisanexceptionwhencomparedtotheothercoursesintheEDSAcurriculum,inthatitisafoundationalcoursewhichprovidesbackgroundtocompletetheremainderofthedatasciencecourses.Assuch,itisslightlyfurtherawayfromwhatmightbeconsidereda“core”ofdatascience.
3.1.1 LearningObjectivesAftercompletionofthiscourse,participantswillbeableto:
1. Describehowcomputingispossiblethroughaseriesofabstractions2. Assessthemosteffectivealgorithmforaparticulartask3. Designsimplealgorithmsforcommoncomputingtasks4. RetrievedatastoredfromaWebAPI,andstoreitinarelationaldatabase5. DevelopsimpleapplicationsusingthePythonprogramminglanguage6. Approacheverydayproblemswithacomputationalfocus
3.1.2 SyllabusandTopicDescriptionsThisisanintroductorycourse,whichintroducestheconceptsrequiredtobeabletoworkasacomputerscientist.Thelayoutofthecourseisasfollows:
3.1.3 AlgorithmicThinkingInthissection,studentsareintroducedtotheconceptthattherearedifferentwaysofperformingtasks,someofwhicharebetterthanothers.Inaddition,withthesedifferentways,thestudentswilllearnhowanalgorithmmaybeevaluated.
Algorithms:
● Introduction-Introductiontoalgorithms,andwhytheyareuseful.Comparesacomputationalalgorithmtoafoodrecipe
● DifferentMethods-Therearemanywaysofdoingthesamething,eachwithadvantagesanddisadvantages.Showsthatabetteralgorithmcanleadtoagreaterimprovementinperformancethanfastercomputers
● Evaluation-Havingestablishedthatmorethanonealgorithmcanperformthesametask,weintroducesomeformalmeansofevaluatingtheaveragecase,bestcase,andworstcasescenarios.
1http://dl.acm.org/citation.cfm?id=1118215
D2.6Learningresources3Page9of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
● Tractability-Notallalgorithmscancompleteinareasonabletime,sometimesanapproximatesolutionisabetteroption
DataStructures:
● Introduction -Datastructureshavedifferentforms,tooptimiseaparticulartypeoffunctiongivencertainconstraints
● Simpledatastructures-Introducesthestackandqueueasdatastructures,andhowefficientlytheyperformcertaintasks
3.1.4 ComputingasanabstractionIn this section, the course looks at ways in which computers make use of black boxes and otherabstractionstobuilduponwhathasbeencompletedbefore:
● Computerarchitecture-Acomputerisaclearexampleofanabstraction,discusseshowitgetsfromswitchesandgatestoanoperatingsystem
○ Binary/machinecode,logicgates,hex,assembly● Programmingasanabstraction-Computersaregoodatautomation,hereweseehowthe
automationofsomeprinciplesleadsto○ Compiledvsinterpreted○ Differenttypesofprogramming,e.g.,procedural,functional,objectoriented,eventbased○ Higherlevellanguages
● Objectorientedprogramming-Aclassisanabstractionofathing,containingrepresentationsonlyofthethingswewantitto.
● Dataasanabstraction○ Howdataorinformationareabstractedfrombinarytorepresenttext,statistics,orother
mediacontent
3.1.5 TheWebandtheCloudTheWorldWideWeb(theWeb)iscentraltomoderndataprocesses.Inthissection,weintroducetheWebasaninstanceofacomputernetwork. WiththeubiquityofInternetconnections,aninevitableextensionwastheconceptoftheCloud,whereonlineoperatorsprovideandmanageserviceswhichwouldpreviouslyhavebeendoneonabusiness’sowninfrastructure.
● Whatarecomputernetworks?Anintroductiontobasicconceptsofhowdataaretransportedacrossanetwork
● TheWWW-Themostwidelyknownnetwork○ Client/servermodel-ExplainsthemodelwhichisusedbytheWeb○ HTTP-Explainsthebackgroundtohowtheprotocolhandlesrequestandresponseto
connecttotheWeb,andtheuseofdifferentRESTverbs○ DisplayingpagesontheWeb
■ HTML&CSS-separationofconcernsbetweencontentandpresentation● TheCloud
○ Introductiontothecloud-Servicesprovidedbycloudcompaniesvaryfromprovisionofhardwaretofullyrunningservices.Thissectionexplainsthecloudbusinessmodel.
○ Types of cloud services - This section introduces the different types of servicesprovidedbycloudcompanies,suchasIaas,PaaS,SaaS
Page10of36EDSAGrantAgreementno.643937
3.1.6 DataManagementThissectionuses“datamanagement”inabroadway,astodefinealldifferentmeansofmanipulatingandrepresentingdata.
● Introductiontodatamanagement○ Storedatasomewhere,whilstbeingabletohaveaccesstoit○ Differenttrade-offswithdifferentsituations○ Typesofdata:Unstructured,semi-structured,structured○ Differentscenariosfordatamanagement
● Data Representation - Provides examples of a variety of scenarios of how data may berepresented,suchthatitcanbemanipulatedforanalysisinthefuture.
○ Text files - The simplest way of representing data. Introduces the following dataformats:csv/tsv/json/xml,andgoodpracticesforlogfiles
○ Images,videos-Followinganalysisofdataitmayberepresentedinanimageorvideo;orthesecouldthemselvesbeusedasdatatobeanalysed.Discussionofcommonimageandvideoencoding.
○ Relationaldatabase-Themostcommonwayofstoringdata,usingSQLtomanipulateandretrieve.
○ NoSQL-IntroducestheCAPtheorem,anddiscussestrade-offsbetweenrelationalandnon-relationaldata.
○ APIs-DatastoredinsomemannercanbemadeavailablefromanAPI.BuildsonthediscussionfromHTTPandintroducesRESTAPIs
3.1.7 PythonProgrammingThisisdonealongsidetheothersections,asameansofapplyingthetheoreticalconceptsintroducedintheotherweek. Itwill not give expertise inPythonprogramming, butwill provide a foundation todevelopexpertiseinthefuture.TheintentionisnottoteachhowtoprograminPythonsomuch,butrathertoapplytheprinciplesintheremainderofthecourse.
PythonischosenaboveRforthiscurriculum,sinceitismoregeneralpurpose,andmorerelevanttotheprinciples of programming generally as opposed to R which fits better as a language for themathematicalsideofprogramming.Inaddition,thereareothercourseswithinthiscurriculumwhichadditionallycoverR,suchas‘StatisticalandMathematicalFoundations’.TeachingPythonwillcoveronlytheessentialssufficienttoperformthesetasks,withstudentsreferredtoothercoursesshouldtheywishtogainamoredetailedunderstanding.
● Settingupadevelopmentenvironment● Primitivetypes● Controlflowandloops● Classesandfunctions
3.2 Learningmaterials&deliverymethodsThecourseisofferedbytheUniversityofSouthamptonandwillbedeliveredviatheFutureLearnMOOCplatform.Thematerialsaredeliveredthroughacombinationofvideos/slides,aswellastextwritteninmarkdownintheplatformitself.Participantsgetanopportunitytopracticetheskillstheyhavelearntthroughexercises,andquizzesintheFutureLearnsystem.
D2.6Learningresources3Page11of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
Figure1:SampleofthelearningmaterialsoftheProgramming/ComputationalThinking
module.
Figure2:ScreenshotoftheProgramming/ComputationalThinkingmodulehostedbythe
FutureLearnplatform.
Page12of36EDSAGrantAgreementno.643937
3.3 RelevancetocurriculumThis course is a foundational module in the EDSA curriculum, scheduled to be released inM36. Itaddresses the “Computational thinking” module, as this was defined in the final EDSA curriculumpresentedinD2.3.
3.4 RelevancetodemandanalysisInD1.4,itwasrecommendedthatEDSAshouldcreateadatascienceskillsframework,whichwasdonebasedon theEDISONskills areas. Inaddition, thedemand fordifferent skills basedon thedemanddashboardwasusedinordertoidentifyrelevantskills.AsdiscussedinD2.3,therewasadistinctionbetweenwhatshouldcomprisecore“datascience”,andskillswhichwouldberequiredtoobtainajobindatascience.Thiscoursefillsthegapasabasiccourse,whichprovidestheprerequisitesforbeingabletoperformthemoreadvancedtechnicalskillswhilstallowingothercoursestoretainthefocusoncoredatascienceskills.
3.5 FurtherdevelopmentplansThiscoursewillbereleasedontheFutureLearnplatforminearly2018.Followinginitialrun,feedbackwillbesoughtwithaviewtodevelopingbasedonthis.Inaddition,extraareaswillbeinvestigatedforpossibleinclusioninfutureversionsofthecourse.
D2.6Learningresources3Page13of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
4. DataIntensiveComputing
4.1 ModuleoverviewThedataintensivecomputingmoduleiscomposedofthreeparts.PartI(Data-IntensiveComputing)provides a solid foundation for understanding large scale distributed systems used for storing andprocessingmassivedata.Awidevarietyofadvancedtopicsindataintensivecomputingareintroduced.Thesetopicsincludedistributedfilesystems,NoSQLdatabases,processingdata-at-rest(batchdata)anddata-in-motion (streaming data), graph processing, and resource management. Part II (AdvancedTopicsinDistributedSystems)complementsthefirstpart,byfocusingontheconceptsofgraphtheorywhichallowtoexplaintheconnectivityanddynamicsofmanyreal-worldnetworks.TheobjectiveofPartIIistoprovideadeeperunderstandingandstudyofthebehaviourofthenetworksarisingwithinDistributedSystems.ThecourseinPartIIwillcoverthetopicsofDistributedDataManagement,LargeGraphProcessing,Publish/SubscribeSystemsandNavigableSmall-WorldOverlays.PartIII(ScalableMachineLearningandDeepLearning)providesthestudentswithasolidfoundationforunderstandinglarge-scale machine learning algorithms, in particular, Deep Learning, and their application areas.Techniques for efficient parallelization of machine learning algorithms are explained to show theintersectionbetweendistributedsystemsandmachinelearningfields.
4.1.1 PartI(Data-IntensiveComputing)SyllabusDescriptionDetailedContent
Syllabus Description
Introduction Conceptsandprinciplesofcloudcomputinganddataintensivecomputing.CloudcomputingandBigData(maintrends,definitionsandcharacteristics).CloudComputingModels:IaaS,PaaSandSaaS.Clouddeploymentmodels.DimensionsofBigData:Volume,Velocity,Variety,Vacillation.BigDataStack:Processing,Storage,ResourceManagement.
Storage DistributedFileSystems.GoogleFileSystem(GFS).MasterOperations.SystemInterations.FaultTolerance.FlatDatacenterStorage(FDS).DatabasesandDatabasemanagement.NoSQL.Dynamo.DataConsistency.BigTable.Cassandra
ParallelProcessing
ProgrammingLanguages:CrashcourseinScala.MapReduce.FlumeJava.Dryad.SparkandSparkSQL.ProjectTungsten
StreamProcessing
Introductiontostreamprocessing.SPSprogrammingmodel.DataFlowCompositionandManipulation.Parallelization.FaultTolerance.DistributedmessagingSystem.Kafka.Storm.SEEP.Naiad.SparkStreaming.StructuredStreaming.FlinkStream.MillWheel.GoogleCloudDataflow
GraphProcessing
GraphAlgorithmsCharacteristics.Data-parallelvs.Graph-ParallelComputation.Pregel.GraphLab.PowerGraph(GrpahLab2).GraphX.X-Stream.Edge-CentricprogrammingModel.Streamingpartitions.Chaos.Storageandcomputationmodel
Machine DataandKnowledge.KnowledgeDiscoveryfromData.Data
Page14of36EDSAGrantAgreementno.643937
learningwithMllibandTensorflow
MiningFunctionalities.ClassificationandRegression(SupervisedLearning).Clustering(Unsupervisedlearning).MLlib.Tensorflow
ResourceManagement
Mesos.YARN
ExistingMaterials
• DataIntensiveComputing,RoyalInstituteoftechnology,KTH• Data-intensiveComputingSystems,DukeUniversity,
https://www.cs.duke.edu/courses/spring15/compsci516/• Data-IntensiveComputing,IllinoisInstituteofTechnology,
http://www.cs.iit.edu/~iraicu/teaching/CS554-F13/• Data-IntensiveScalableComputing,BrownUniversity,http://cs.brown.edu/courses/csci2950-
u/f11/• Data-IntensiveComputing,TheCityUniversityofHongKong,
https://www.cityu.edu.hk/ug/201415/course/CS4480.htm• DataIntensiveComputingandClouds,UniversityofCentralFlorida,
http://www.eecs.ucf.edu/~jwang/Teaching/EEL6938-s12/• DataIntensiveComputing,UniversityofBuffalo,
http://www.cse.buffalo.edu/shared/course.php?e=CSE&n=587
FurtherReading
• P.Mell,andT.Grance.TheNISTDefinitionofCloudComputing• 800-145.NationalInstituteofStandardsandTechnology(NIST),Gaithersburg,MD,(September2011)• Armbrust,M.,Fox,A.,Griffith,R.,Joseph,A.D.,Katz,R.H.andKonwinski,A.,Lee,G.,Patterson,D.A.
,Rabkin,A.,Stoica,I.,andZaharia,M.AbovetheClouds:ABerkeleyViewofCloudComputing.EECSDepartment,UniversityofCalifornia,Berkeley,TechnicalReportNo.UCB/EECS-2009-28,February10,2009
• S.Ghemawat,H.dGobioff,andShun-TakLeung.TheGoogleFileSystem.19thACMSymposiumonOperatingSystemsPrinciples,LakeGeorge,NY,October,2003
• E.B.Nightingale,J.Elson,J.DeanandS.Ghemawat.FlatDatacenterStorage.USENIXAssociation10thUSENIXSymposiumonOperatingSystemsDesignandImplementation(OSDI’12).
• DeCandia,G.,Hastorun,D.,Jampani,M.,Kakulapati,G.,Lakshman,A.,Pilchin,A.,Sivasubramanian,S.,Vosshall, P., andVogels,W.Dynamo:Amazon'sHighlyAvailableKey-value Store. ProceedingsofTwenty-firstACMSIGOPSSymposiumonOperatingSystemsPrinciples,SOSP’07
• Chang,F.,Dean, J., Ghemawat,S.,Hsieh,W.C.,Wallach,D.A., Burrows,M., Chandra,T., Fikes,A.,Gruber,R.E.Bigtable:ADistributedStorageSystemforStructuredData.ACMTrans.Comput.Syst.,4:1--4:26,2008
• Lakshman,andP.Malik.Cassandra:adecentralizedstructuredstoragesystem.OperatingSystemsReview44(2):35-40(2010)
• J. Dean , S. Ghemawat. MapReduce: simplified data processing on large clusters. OSDI’04:PROCEEDINGSOFTHE6THCONFERENCEONSYMPOSIUMONOPERATINGSYSTEMSDESIGNANDIMPLEMENTATION
D2.6Learningresources3Page15of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
• Chambers,A.Raniwala, F. Perry, S.Adams,R.Henry,R.BradshawandNathan. FlumeJava:Easy,EfficientData-ParallelPipelines.ACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation(PLDI),ACMNewYork,NY2010,2PennPlaza,Suite701NewYork,NY10121-0701(2010),pp.363-375
• Zaharia,M.,Chowdhury,M.,Das,T.,Dave,A.,Ma,J.,McCauley,M.,Franklin,M.J.,Shenker,S,Stoica,I.Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing.Proceedings of the 9th USENIX Conference onNetworked SystemsDesign and Implementation.2012
• G.Cugola,AlessandroMargara.Processingflowsofinformation:Fromdatastreamtocomplexeventprocessing.ACMComput.Surv.44(3):15:1-15:62(2012)
• J.-H.Hwang,M.Balazinska,A.Rasin,U.Çetintemel,M.Stonebraker,S.B.Zdonik:• High-AvailabilityAlgorithmsforDistributedStreamProcessing.ICDE2005:779-790• R.C.Fernandez,M.Migliavacca,E.Kalyvianaki,andP.Pietzuch.2013.Integratingscaleoutandfault
toleranceinstreamprocessingusingoperatorstatemanagement.InProceedingsofthe2013ACMSIGMODInternationalConferenceonManagementofData(SIGMOD'13).ACM,NewYork,NY,USA,725-736.
• M.Zaharia,T.Das,H.Li,S.Shenker,andI.Stoica.2012.Discretizedstreams:anefficientandfault-tolerantmodelforstreamprocessingonlargeclusters.InProceedingsofthe4thUSENIXconferenceonHotTopicsinCloudCcomputing(HotCloud'12).USENIXAssociation,Berkeley,CA,USA,10-10.
• T.Akidau,R.Bradshaw,C.Chambers,S.Chernyak,R.J.Fernández-Moctezuma,R.Lax,S.McVeety,D.Mills, F. Perry, E. Schmidt, and S. Whittle. 2015. The dataflow model: a practical approach tobalancingcorrectness,latency,andcostinmassive-scale,unbounded,out-of-orderdataprocessing.Proc.VLDBEndow.8,12(August2015),1792-1803.
• X.Meng,J.Bradley,B.Yavuz,E.Sparks,S.Venkataraman,D.Liu,J.Freeman,DBTsai,M.Amde,S.Owen,D.Xin,R.Xin,M. J.Franklin,R.Zadeh,M.Zaharia,andA.Talwalkar.2016.MLlib:machinelearninginapachespark.J.Mach.Learn.Res.17,1(January2016),1235-1241.
4.1.2 PartII(AdvancedTopicsinDistributedSystems)SyllabusDescriptionDetailedContent
Syllabus Description
Introduction Networks:behaviouranddynamics.AnalysisofNetworks.Networksversusgraphs.Complexityofnetworks.
MainConcepts Basicdefinitions.Paths.Cycles.Connectivity.(Giant)Components.Distance.
Networkmodels
G(n,m)model.Erdos-Renyirandomgraph.Randomgraphsandrealworld.Watts-Strogatzmodel.Small-Worldsmodel.Preferentialattachmentmodel.Randomwalks.
PageRank, ConvergenceofRandomWalk.RelationtoWebsearch.Google
Page16of36EDSAGrantAgreementno.643937
GraphSpectra
pagerank.Topicspecificpagerank.Graphspectrum.Spectralgraphpartitioning.
GraphExploration,NavigableSmall-WorldNetworks
Howtoexplorebignetworks?Influenceofnodedegree.Milgram’s(Small-worlds)experiment.ImplicationsforP2Psystems.NavigationinWatts-StrogatzSmall-Worlds.Kleinberg’smodelofSmall-Worlds.
NavigableStructuredOverlays,GossipingAlgorithms,Cyclon
Small-WorldsbasedP2Poverlays.ApproximationofKleinberg’smodel.TraditionalDHTsandKleinberg’smodel.Basicnavigationprinciples.Gossipingalgorithms.Cyclon.
Topologyconstructionbygossiping.Publish/SubscribeSystems
Proactivegossipframework.Topologycreation.T-man.Publish/Subscribesystems.Topicbasedpublish/subscribe.ContentTopicbasedpublish/subscribe.Filterbasedrouting.Overlay-Per-Topicpublish/subscribe.SpiderCast.Interest-Aware(greedy)Links.Tera.Rendezvousbasedpublish/subscribe.
HybridPub/SubSystemsScribe
Decentralizedpublish/subscribe.Vitis.BuildingNavigableStructure.
ExistingMaterials
• AdvancedTopicsinDistributedComputing,atRoyalInstituteofTechnology,KTH• NetworkAnalysis,UniversityofMichigan,
https://www.icpsr.umich.edu/icpsrweb/sumprog/courses/0131• SocialandInformationNetworkAnalysis,StanfordcenterforProfessionalDevelopment,
http://scpd.stanford.edu/search/publicCourseSearchDetails.do?method=load&courseId=7932016• SocialNetworkAnalysis,UniversityofMichigan,Coursera,https://www.class-
central.com/mooc/338/coursera-social-network-analysis• NetworkAnalysisandModelling,SantaFeInstitute,
http://tuvalu.santafe.edu/~aaronc/courses/5352/
D2.6Learningresources3Page17of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
FurtherReading
• "Networks,Crowds,andMarkets:ReasoningAboutaHighlyConnectedWorld"byDavidEasleyandJonKleinberg
• "Networks:AnIntroduction"byMarkNewman• “FoundationsofDataScience”byJ.HopcroftandR.Kannan
4.1.3 PartIII(ScalableMachineLearningandDeepLearning)SyllabusDescriptionMainTopics:
• MachineLearning(ML)Principles
• UsingScalableDataAnalyticsFrameworkstoparallelizemachinelearningalgorithms
• DistributedLinearRegression
• DistributedLogisticRegression
• DistributedPrincipalComponentAnalysis
• LinearAlgebra,ProbabilityTheoryandNumericalComputation
• FeedforwardDeepNetworks
• RegularizationinDeepLearning
• OptimizationforTrainingDeepModels
• ConvolutionalNetworks
• SequenceModelling:RecurrentandRecursiveNets
DetailedContent
Syllabus Description
Introduction Abriefhistoryandapplicationexamplesofdeeplearning,Large-ScaleMachineLearning:atGoogleandinindustry,MLbackground,briefoverviewofDeepLearning,understandingDeepLearningSystems,LinearAlgebrareview,ProbabilityTheoryreview.
DistributedMLandLinearRegression
SupervisedandUnsupervisedlearning,MLpipeline,Classificationpipeline,Linearregression,DistributedML,ComputationalComplexity
GradientDescentandSparkML
Optimizationtheoryreview,gradientdescentforLeastSquaresRegression,TheGradient,Large-ScaleMLPipelines,FeatureExtraction,FeatureHashing,ApacheSparkandSparkML.
Page18of36EDSAGrantAgreementno.643937
LogisticRegressionandClassification
ProbabilisticInterpretation,MultinomialLogisticClassification,ClassificationExampleinTensorflow,QuickLookinTensorflow
FeedforwardNeuralNetsandBackprop
NumericalStability,NeuralNetworks,FeedforwardNeuralNetworks,FeedforwardPhase,Backpropagation
RegularizationandDebugging
AFlowofDeepLearning,TechniquesforTrainingDeepLearningNets,Regularization,WhydoesDeepLearningwork?
ConvolutionalNeuralNetworks
HowConvolutionalNeuralNetsWork,ConvNetswithDepth,MemoryComplexity,CaseStudies,ConvNetsforEverything
RecurrentNeuralnetworks,Sequence-to-SequenceLearningandAutoencoders,LongShort-termMemory(LSTM),Attention,Autoencoders:UnsupervisedFeatureLearning,
DeepReinforcementLearning
MarkovdecisionProcesses,OverviewofReinforcementlearning,SupervisedLearningvsReinforcementLearning,DeepRL,Q-learning,DeepPolicyNetworks,DistributedRLArchitecture,AsynchronousRL.
CaseStudy AlphaGo,WhenwillDeepLearningbecomeIntelligent?
ExistingMaterial
• ScalablemachinelearningandDeepLearning,atRoyalInstituteoftechnology,KTH
• ScalableMachineLearning,edX,https://courses.edx.org/courses/BerkeleyX/CS190.1x/1T2015/info
• DistributedMachineLearningwithApacheSparkhttps://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x
• DeepLearningSystems,UniversityofWashington,http://dlsys.cs.washington.edu/
• ScalableMachineLearning,UniversityofBerkeley,
https://bcourses.berkeley.edu/courses/1413454/
D2.6Learningresources3Page19of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
FurtherReading
• IanGoodfellowandYoshuaBengioandAaronCourville.DeepLearning,MITPress• SparkMLPipelines,http://spark.apache.org/docs/latest/ml-pipeline.html• SparkMLOverview,https://www.infoq.com/articles/apache-sparkml-data-pipelines• SparkMLundertheHood,http://spark.tc/machine-learning-in-apache-spark-2-0-under-the-hood-
and-over-the-horizon-2/?cm_mc_uid=02948943737214702940997&cm_mc_sid_50200000=1472964414
• Tensorflow,https://www.tensorflow.org/versions/r0.11/tutorials/
4.2 Learningmaterials&deliverymethodsThecourseisofferedbyKTHandeachofitspartsisdeliveredasfollows:
PartI:Data-IntensiveComputing
The course consists of thirteen face-to-face lectures in the form of presentations. Slides and videolecturesareprovidedtostudentsthroughaninternaluniversityportal.Studentshavetostudyacoupleof research papers before each lecture. Four Exercises are provided as lab assignments,which arediscussedwithstudentswith teacherassistants.Tworeadingassignmentsare toreviewpapersandencouragethestudentstoanalyzepapersinacriticalway.
Figure3:ScreenshotfromtheIntroductionpresentationofPartI(Data-IntensiveComputing).
Page20of36EDSAGrantAgreementno.643937
Figure4:ScreenshotfromtheBigDataStoragepresentationofPartI(Data-Intensive
Computing).
PartII:AdvancedTopicsinDistributedSystems
Thecourseconsistsofeightface-to-facelecturesintheformofpresentations.Slidesandvideolecturesareprovidedtostudents.Studentshavetostudyacoupleofresearchpapersbeforeeachlecture.Tworeadingassignmentsaretoreviewpapersandencouragethestudentstoanalyzepapersinacriticalway.Studentsarerequiredtogiveanoralpresentationrelatedtothepaperstheyreview.Belowaresomescreenshotsfromthecourselearningmaterials.
Figure5:ScreenshotfromthelearningmaterialsofPartII(AdvancedTopicsinDistributed
Systems).
D2.6Learningresources3Page21of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
Figure6:ScreenshotfromthelearningmaterialsofPartII(AdvancedTopicsinDistributed
Systems).
PartIII:ScalableMachineLearningandDeepLearning
Thecourseconsistsofelevenface-to-facelecturesintheformofpresentations.Studentsarerequiredto work on programming assignments using Scala/python programming languages. They are alsorequiredtosucceedinaprojectwhichideatheychoosefromthecontextofthecourseandusingdeeplearning.Belowaresomescreenshotsfromthecourselearningmaterials.
Figure7:ScreenshotfromthelearningmaterialsofPartIII(ScalableMachineLearningand
DeepLearning).
Page22of36EDSAGrantAgreementno.643937
Figure8:ScreenshotfromthelearningmaterialsofPartIII(ScalableMachineLearningand
DeepLearning).
4.3 RelevancetocurriculumThecoursecorrespondstothe“Data-IntensiveComputing”moduleoftheEDSAcurriculum,asthiswaspresentedinD2.3.
4.4 RelevancetodemandanalysisThecoursesprovideasolidfoundationforstudentsinmachinelearninganddataminingcombinedwithdistributedcomputing.Twoofthesecoursesaregiveninablendedformat,andthethirdoneisaface-to-facecourse.Thecoursesarerichintopsoftskillsrequiredcurrentlyinthedatasciencefield.Studentsareencouragedtodopeer-reviewsdiscussions,oralpresentationsandpaperreviewsinthesecourses,which improve their communication skills. All courses in this module use open-source tools andprogramming libraries. There are plans to make some of these courses reachable across differentorganizations.
4.5 FurtherdevelopmentplansThecontentsofallcoursesofthismoduleareupdatedaccordingtorecentpapersandimprovementsinthefield.Labassignmentsarealsoenhancedeachtimethecourseisgiven.Studentsprovideevaluationat the end of course about the programming languages, and platforms that are used. Courses areadjustedaccordingtoreviewsthatarerelevant.ThereareplanstoprovidesomeofthesecoursesasMOOCcourses.
D2.6Learningresources3Page23of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
5. SocialMediaAnalytics
5.1 ModuleoverviewTheparticipantsgetasoundoverviewofsocialmediaanalysisproblemsandanalysistechniques.Targetistheextractionandsummarizationofuserstatementsfromsocialmediasources.Therelevanttextdata mining concepts and deep learning approaches will be introduced. The underlying modelassumptionsarediscussedasfarastheyareimportantforthepracticalapplicationandinterpretationofresults.Specialemphasisisputontheperformanceevaluationofprocedures.Tobeabletoprocesscomprehensive collections specific Big Data implementationswill be utilized. Most approaches aredemonstrated by stepping through small Python scripts to show the necessary computations. TheparticipantswillbeabletorealisticallyassesstheapplicationofsocialmediaanalysistechnologiesfordifferentusagescenariosandcanstartwiththeirownexperimentsusingdifferentanalysispackageswithPython.
Syllabus Conceptsandmethods
Introduction
•SocialMedia
•MediaAnalytics
•Recentsuccessstories
Typesofsocialmedia
Userbaseofsocialmedia
Relevanceofsocialmediaanalysis
Analysisbymachinelearning
Mainmachinelearningparadigms
Differenttypesofanalysisquestions
Importantapplications
Downloadandpreparation
•Documentformats
•Corpora
•Google,Twitter,Facebook
•Preparationofdata
Searchenginesfordocumentcollection
crawlingpackages
parsingofdifferentdocumentformats
languagedetection
sentencesplittingandtokenization
partofspeechtagging
Examplescripts
Classifysocialmediaposts
•Commonapproaches
•Performanceevaluation
•Application
Definition
Differenttypes:logisticregression,
supportvectormachine
Overfitting
testandperformancemeasures
featureselection
BigDataapproaches:Spark,TensorFlow
ExampleScripts
Page24of36EDSAGrantAgreementno.643937
Wordsimilarityinposts
•Groupwordsinposts
•Detectingtopicsinposts
Unsupervisedlearningofwordproperties
topicmodels:Assumingunderlyingtopics
word2vec:predictingwordsintheneighborhood
stochasticoptimization
negativesampling
BigDataimplementations:Spark,Tensorflow
Examplescripts
Clusteringsocialmediaposts
•Similarityofpost
•visualization
Definingsimilaritymeasuresusingtopicmodelsandwordembeddings
Clusteringapproaches
Visualizationandspecialization
Interpretationofresults
Examplescripts
Detectingnames,products
•Detectingentities
•Longrangedependencies
Predictpropertiesofwords
Probabilistic:ConditionalRandomField
recurrentneuralnetwork
automaticfeatureselection
wordmodels
bidirectionalrecurrentneuralnetwork
testandperformancemeasures
practicaltrainingandevaluation
Examplescripts
Opinionsinsocialmedia
•Predictusersentiment
•opinionsandaspects
•phrasesandtheirrelations
Attachinglabelstophrasesandsentences
Typesofopinionmining
Aspectbasedopinions
Evaluation
Examplescripts
Practicalsocialmediaanalytics
•Lessonslearnt
•selectingalgorithms
•combiningapproaches
Descriptionofapracticalevaluation
Combiningdifferentapproachesandalgorithms
Evaluationandperformanceassessment
visualization
Existingtraining:
● Face-to-facetrainingatFraunhofer:http://www.iais.fraunhofer.de/socialmediaanalytics.html
● https://www.diygenius.com/10-free-online-courses-in-social-media-and-inbound-marketing/
D2.6Learningresources3Page25of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
● https://apps.ep.jhu.edu/course-homepages/3523-605.433-social-media-analytics-piorkowski-mcculloh
● https://www.coursera.org/specializations/social-media-marketingExercises:Startingwiththeexamplescriptstheparticipantswillbegivennewdatasets.Theyhavetoselect an analysis technique, perform the analysis and report and interpret the evaluation andvisualizations.
Sourcesforfurtherreading:
• G.F.KhanSevenLayersofSocialMediaAnalytics(2015):MiningBusinessInsightsfromSocialMediaText,Actions,Networks,Hyperlinks,Apps,SearchEngine,andLocationData.Kindle
• T.Schreck,D.Keim(2013),VisualAnalysisofSocialMediaData,• C.C.Aggarwal,C.X.Zhai(2012)MiningTextData.Springer• CDManning,PRaghavan,HSchütze(2008):IntroductiontoInformationRetrieval• C.Biemann,A.Mehler(2015):TextMining:FromOntologyLearningtoAutomatedTextProcessing
Applications• http://www.iais.fraunhofer.de/data-scientist.html• https://www.tensorflow.org/• http://scikit-learn.org/stable/
Descriptionoftopics:
1. Introduction:Differenttypesofsocialmedia(Facebook,Twitter,Instagram,etc.)areconsideredand the utilization by users is discussed. Social media analysis is relevant as many actualdecisionofusers,e.g.buyinggoodsorvotinginelectionsarebasedonsocialmedia.Thiscourseaimsatextractingandsummarizingthestatementsofusersfromsocialmediatogivearealisticimpressionofthepublicviewonrelevantquestions.Differenttypesofanalysisquestionswillbediscussed.Therelevantmachinelearningconceptswillbeintroduced.Finallyanumberofapplicationhighlightswillbepresented.
2. DownloadandPreparation:Inthislecturetechniquesfordownloading,crawlingandmonitoringsocialmediawillbepresented.Asasecondstepthedifferentformat–pdf,text,html,ePUB–havetobeparsedtoextracttheunderlyingtext.Thistextusuallyispre-processedtoextractwords and sentences. Often only specific parts of the documents are required, e.g. htmlnavigation or advertisements have to be excluded. Part-of-speech tagging may be used toexclude specific types ofwords – e.g. functionwords. The approaches are demonstrated byPythonsamplescripts.
3. Classify Social Media Posts: Often we want to categorize the documents according to theircontents, e.g. soccer, politics, recipes, etc. This can be done with high reliability by textclassification procedures. These approaches require a training set with manually classifiedexamples. Logistic regression and support vector machines are introduced as relevanttechniques.Wediscussthetrainingprocessandthephenomenonofoverfitting,whichmaybecontrolledbyregularization.Correspondingperformancemeasuresbasedonanindependenttest set are required. We use up to date libraries like scikit-learn as well as Spark andTensorFlow,whicharereadyforBigDataapplications.TheapproachesaredemonstratedbyPythonsamplescripts.
4. ClusteringWordsandSocialMediaPosts.Mostinterestinginsocialmediaanalyticsistoshowthespectrumofissueswhichareaddressedinposts.Onerelevantanalysisapproachinvestigatesthesimilarityofwordsbyputtingwordsintoahigh-dimensionalsemanticspace.Topicmodelsassumethatthereareanumberofunderlyingthemesinasocialmediasource.Thesetopicsareextracted without any user input (unsupervised) and each word in a document can be
Page26of36EDSAGrantAgreementno.643937
representedasamixtureoftopics.Thedifferentwordembeddingapproach(word2vec)directlyrepresentseachwordasavector inahigh-dimensionalsemanticspace.Both techniquesareintroduced, and their training in the Big Data context is discussed. Both techniquesmay beextendedtodefinethecontentofadocumentaswellasthesimilarityofcompletedocuments.TheapproachesaredemonstratedbyPythonsamplescripts.
5. DetectingNamesandProducts.Aspecificrequirementinsocialmediaanalyticsisthedetectionofentities,i.e.phrasesofaspecifictype.Thesehavehighlyvaryingformandcanonlydetectedfromtheircontext.Hence,weintroducetwotypesofwordsequenceanalysisproceduresusingpre-labelled training data: Conditional Random Fields and Recurrent Neural Networks. Wediscusstheunderlyingassumptionsandtrainingproceduresandassesstheirperformancebyappropriateevaluationmeasures.TheapproachesaredemonstratedbyPythonsamplescripts.
6. OpinionMininginSocialMedia.Opinionmining(orsentimentanalysis)hastheaimtoextractsubjective user statements on specific issues from a social media post. There are differentapproachesrangingfromassigninganopinionscoretoawholedocumenttodetectingopinionsspecifiedforaspecificaspect.Wediscussarangeofanalysistechniquesincludingclassificationandwordsequenceanalysis.TheapproachesaredemonstratedbyPythonsamplescripts.
7. PracticalSocialMediaAnalysis.Socialmediacontentanalysisrequirestheknowledgeofalargetoolboxoftechniqueswithspecificunderlyingmodelsandlimitations.Fromourownprojectexperience,wegivepractical hints on the selectionof techniques, the successionof analysissteps, and their evaluation. Most important is the visualization of results, especially forclusteringapproachesliketopicmodellingorwordembedding.Wedemonstratethisusingt-SNEandGephi.Finally,usersmaydiscusstheirownquestionswiththeinstructors.
5.2 Learningmaterials&deliverymethodsThecourse isofferedbyFraunhoferandconsistsof two face-to-facedays, fullofpresentationswithcompactinformation.Scriptsareshownanddemonstrated,butthereisnotimeforexercises.Belowaresomescreenshotsfromthefirstpresentation.
Figure9:ScreenshotfromthelearningmaterialsoftheSocialMediaAnalyticsmodule.
D2.6Learningresources3Page27of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
Figure10:ScreenshotfromthelearningmaterialsoftheSocialMediaAnalyticsmodule.
Figure11:ScreenshotfromthelearningmaterialsoftheSocialMediaAnalyticsmodule.
5.3 RelevancetocurriculumThecoursecorrespondstomodule13ofthefinalEDSAcurriculum,asthiswaspresentedinD2.3.Itaddressestheanalyticstageatanadvancedlevel.Itconcentratesontextualdataspecificallyfromsocialmedia.Thus,itisanattempttotargetaspecificsector.
5.4 RelevancetodemandanalysisThecourseconcentratesonopensourcetoolsandlibraries.Butitisratherspecialanddoesnottrytoreachparticipantsfromdifferentunits.Also,itdoesnotincludeblendinglearningformats,butthisisgoingtochange.
Page28of36EDSAGrantAgreementno.643937
5.5 FurtherdevelopmentplansFraunhoferhasathree-levelcertificationprogramfordatascientists.In2018,anewcertificate“DataScientistSpecialized inMachineLearning”will beadded. It includes an examinationwhich tests fortheoreticalknowledgeandpracticalskills.Thisexaminationispreparedbyagroupofblendedcourses.Allcoursesstartface-to-faceandendwithapracticalphase,wheretheparticipantsdoexerciseswithbigdatasetsinadistributedmachinelearningplatform.Themandatorycourseisaboutdeeplearningandothercurrentmethodsofmachinelearning.Amongtheoptional,morespecialcoursesisacourseondeeptextanalytics2,firsttobedeliveredinMay2018.Thisblendedcoursewillreplacetheformerpurelyface-to-facecourseon“SocialMediaAnalytics”.
2https://www.bigdata.fraunhofer.de/de/datascientist/seminare/machine_learning_zertifizierung/textverstehen.html
D2.6Learningresources3Page29of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
6. DataExploitationincludingdatamarketsandlicensing
6.1 ModuleoverviewDatascienceisfundamentallyaboutgeneratingimpactfromdata.Impactisachievedwhendataactsasacatalystforchangeeitherforsocial,environmentaloreconomicgain.
Attheheartofdrivingchangeisfindingandtellingstoriesindata,ascoveredinthe“FindingStoriesinData”3course.Thecombinationofthisalongwithstrongcommunicationandleadershipskillscanhelplead tochange inbusinesspractice ineitherprofitability,marketpositioningand/orefficiency.Datascience can inform and bring evidence for new and changing business models and effective dataexploitationcanalsounlocknewandunexpectedmarkets.
This set of 9 online lessons, offered by ODI, looks at how exploiting a world of data can lead tounexpected benefits for businesses and citizens alike. The 9 lessons are divided into 3 sequentialmodulesthatwillguidelearnersthroughthreekeytopicareasindataexploitation,datamarketsandlicensing.
• Course1:Exploitingdatasciencetocreatevalue• Aim:Enableyoutorecognisetheopportunitiestoexploitdatascienceandanalysetherisksof
doingso.o Lesson1:Exploitingdatascienceo Lesson2:Managingriskandrewardo Lesson3:Openinnovationanddatascience
• Course2:Datasciencemeansbusiness• Aim:Enableyoutoanalyseanumberofdifferentbusinessmodelstoexploitdatascience.
o Lesson1:Businessmodelsindatascienceo Lesson2:Pitchingfordatascienceo Lesson3:Datamarketsandlicensing
• Course3:Datascienceecosystems• Aim:Enableyoutoexploitdatascienceindifferentecosystems.
o Lesson1:Embeddingdatasciencewithinabusinesso Lesson2:Exploringthestartupecosystemo Lesson3:Exploitingdatasciencewithinasector
Curated content in the syllabuswill beaccompaniedwithnarration tohelp guide a learner in theirinterpretation of the resources. Additionally, supporting materials and guidance from those in thecommunitywillhelpexplainhowdatasciencehasopenedupnewmarkets.
6.1.1 Course1:ExploitingdatasciencetocreatevalueThissetoflessonslookbroadlyatthewaysthatdatasciencecancreatevalue,howtoidentifytherightopportunitiesandhowtomanagerisk.Bytheendofthismodulelearnersshouldbeabletomaketherightgo/no-godecisionsforexploitingdatascienceintheirownbusinesses.
Exploitingdatascience
Acrosstheworld,peopleareexploitingdatasciencetoinformdecisions,buildnewbusinessesandformnewcollaborations.Theaimofthislessonistohelplearnersunderstandhowdatasciencecreatesvalueandintroducethemtoanumberofcasestudies.Forexample,welookathowTransportforLondonusedWiFiaccesspointstotrackmovementofpassengersthroughthenetworkintheirfirstevershortstudyabletocollection400,000pointsofdataforanalysis.
3http://courses.edsa-project.eu/course/view.php?id=52
Page30of36EDSAGrantAgreementno.643937
Managingriskandreward
Inordertoexploitdatasciencesuccessfully,itisimportanttobeabletoidentifythepotentialrisksandconsidersolutionstomanageproblemsthatmayarise.Oneofthemainconcernsisthatdatascienceisblurring“disciplinaryboundariesand fieldwithlonghistoriesofethical reviewarebeing inundatedwith work from fields with no history of review and indeed active movements against suchrequirements.”4Aswellasethics,thislessonlooksatopportunityandlegalriskandintroducescasestudiesthatfoundthemselvesinhotwater.Themoduleconcludeswithsometoolsandguidesthatwillhelpmitigateriskswhenexploitingdatascience.
Openinnovationanddatascience
Datascienceblursdisciplinaryboundariesandinvolvesexpertsfrommanyfields.Oftenonecompanywillnothavealltherequiredskillsinhousetorapidlyexploitdatascience.Openinnovationisakeyenablertofastinnovation.Openinnovationisacollaborativeapproachtocreatinginnovativeproductsandservices.Beingcollaborativewiththecommunitycanalsohelpimproveproductreputationorhelpidentifyrisksearlier.Thelessonlooksathowopeninnovationplaysakeyroleindataexploitationalongwithkeycasestudies.
6.1.2 Course2:DatasciencemeansbusinessBuildingfromthefirstcourse,thiscourselooksatwaystoembedinnovationfromdatasciencewithinanexisting(ornew)businessmodel.AkeypartofBusiness Intelligence ishowtousedata todriveforwardabusinessmodel. This set of lessons looks atwheredata science fits indifferentbusinessmodelsandhowtopitchfordatascienceagainstabusinessmodel.Furtherthiscourselooksathowlicensing is a core part of the data market and how different models can be exploited both forcommercialgainandforinnovation.
Businessmodelsindatascience
Manybusinessesarenowusingdatascienceaspartoftheirbusinessmodel.Inordertosuccessfullyexploitdatascience,organisationsneedtounderstandhowdataiscreatingvaluefortheirbusiness.Thislessons looks at how to create a value proposition and the link between this and various businessmodels.Theaimofthislessonistoenablelearnerstorealisehowtoexploitdatasciencedeeplywithinabusinessmodel.
Pitchingfordatascience
Onceapotentialneworinnovativevaluepropositionandbusinessmodelhavebeenbuilt,thenextkeystageistocommunicatethiseffectivelyeitherwithinanorganisationortoexternalfunders.Havingagoodpitchisessentialforgaininginterestfromorganisationleaders,investorsandpotentialpartners.Thishelpsestablishandgrowdatasciencebusinesses.Thislessonlooksathowtocreateaneffectivepitchfordatascienceledinnovation.
Datamarketsandlicensing
Dataisanewrawmaterial.Akeyroleofadatascientististorefinethismaterialtocreatesomethingmorevaluable.Atthesametimerawmaterialsatvariousstagesofrefinementare tradedorsold toothers.Thislessonlooksatsomeofthemarketssurroundingdataandhowtheserelatedtoaspectrumofopportunitiesandbusinessmodels.
6.1.3 Course3:DatascienceecosystemsTheaimofthiscourseistoenableparticipantstoexploitdatascienceindifferentecosystems.Withinthe field of data science, there aremanydifferent levels of ecosystem. These range froma internal
4https://www.forbes.com/sites/kalevleetaru/2017/10/16/is-it-too-late-for-big-data-ethics/#aa20de23a6d1
D2.6Learningresources3Page31of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
businessecosystemthat ismadeupofmanydifferentdepartmentsand teams, toall theactorsandstakeholderswithinasectororindustry.
Embeddingdatasciencewithinabusiness
Moderndatascientistsoftenfacethechallengeofhowtoembeddatasciencewithinabusiness.Forabusinesstofullyembracedatascience,itmustbeattheheartofallbusinessactivities.Inthismodule,weexploretheimportanceofadatasciencestrategyandhowtodevelopadatasciencestrategy,aswellashowtoreduce frictionand increasebuy-in.Thiswillensure thatdatasciencesitsat theheartofbusinessactivities.
Exploringthestartupecosystem
Adatascientistmightworkwithinanorganisationortheymighthaveanewandnovelideathattheywishtobringtomarket.Fordatascientistsenteringthestartupecosystem,itcanbeaverydifferentexperiencefromtheecosystemofanindividualbusiness.Inthismoduleweexplorethedifferentstagesofstartupsthatexistwithin theecosystem, therolesofdifferent fundingtypesandsources,andtheadditionalsupportthatisavailabletothosenavigatingthedatasciencestartupecosystem.
Exploitingdatascienceinasector
Amoderndatascientistisexpectedtocombinedomainexpertisewithmoretraditionaldatascienceskills. Knowing how to exploit data science within a sector is an essential skill for a modern datascientist.Inthislesson,weexploredatascienceecosystemswithinsectors,drawingonexamplesfromtheentertainment,transportandphysicalactivitysectors,aswellasthedifferencebetweendatascienceinaclosed,sharedoropensectorecosystem.
6.2 Learningmaterials&deliverymethodsTheentire3coursecurriculumencompassing9lessonsisofferedbyODIandisavailableentirelyonlineforanyonetoengagewithfreelyatanytimewithouttheneedforenrolment.Althoughthelessonsarenumberedinorder,learnerscanchoosetheirownpathiftheychoose.Eachlessonispresentedintheformof an interactivewebpage following the latest instructionaldesign theoryusing an eLearningplatformvoted themost innovative4 year in a rowat the learnXawards5.The eLearningplatformprovidesacompletelycrossplatformresponsivewayofdeliveringinteractivelearning.Eachlessonisbackedwithlearningoutcomeswhichareassessedthroughtheuseofinteractivequestioningattheendofeach.
Eachlessonfeatures:
● Practicalstepsforaccomplishingkeytasks● Case-studies● Knowledge-checksandquizzes
6.3 RelevancetocurriculumSofar,theEDSAcurriculumhasprovidededucationalresourcesofskillsandmethodologieswhichcovermanyskillsareasindepth.Forexample,thereareatleast5coursesintheareasof“Bigdatatechnologiesandsystems”and“Computingmethodologies”.However,currentlytherearenocourseson“Businessanalysisorganisationandmanagement”andonlyoneintheareasof“Businessanalysisandenterpriseorganisation”and“Businessprocessmanagement”.ByfocussingspecificallyonBusinessIntelligenceandexploitingdatascienceinbusinessesthiscoursehasbeenspecificallydesignedtocoversuchareas.
5https://www.adaptlearning.org/index.php/about/
Page32of36EDSAGrantAgreementno.643937
6.4 RelevancetodemandanalysisThekeyoutcomesofthedemandanalysis(D1.2)providedaclearindicationthattherewasaneedforbroadintroductoryleveltrainingindatascience.8keycurriculumareaswereidentifiedofwhich6focuson“hardskills”:
1. BigData2. Machinelearningandprediction3. Datacollectionandanalysis4. Mathsandstatistics5. Interpretationandvisualisation6. Advancedcomputingandprogramming
Thisleavestwoareasofthecurriculumwhicharemore“softskills”based:
1. Businessintelligenceanddomainexpertise2. Opensourcetoolsandconcepts
This course has been designed focus on how data science can unlock value for new and existingbusinesses.SincethestartoftheEuropeanDataScienceAcademyprojectin2015datasciencehasbeengrowingininterest,whilerelatedareassuchasBigData,AIandMachinelearningcontinuetoalsoshowaconstantfocus.
Figure12:GoogletrendanalysisforBigdataandDataScience.
Importantly this shows thatdata science, as adiscipline is gaining in importance.Data science is adisciplinethatbringsimportantscientificandbusinessapproachestotoolsandtechniqueslikemachinelearninganditisimportanttothusalsofocusonthesoftskillsonwhatapplyingdatasciencemeanstobusinessesandhowmanagerscanmakethemostofdatascienceinexistingbusinesses.
Big dataData science
D2.6Learningresources3Page33of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
Figure13:TheEDSAdashboardskillschart.
ThecurrentEDSAdashboardalsoshowsthegrowinginterestinexploitingopendatainbusinesses.The4thmostrequiredskillinjobsiscurrentlybusinessintelligence.
Businessintelligenceisfundamentallyaboutthestrategiesandtechnologiesthatallowbusinesstobuildinsightfromdata.InordertoexploitBIrequiresknowledgeofboththehardskills(technologies)andthesoftskills (strategies) tounderstandthe implicationsofsuchdataanalysis.Thus,datascientistsrequire some level of strategic knowledge while existing managers need to understand theopportunitiesandrisksofapplyingdatascienceandotherBItechniques.
Thisrequirementwasalsoemphasised inthedemandanalysis(D1.2).D1.2pointedoutareportbyMcKinseyonBigDataandthelackofmanagerswillingittakerisksonemployingdatascientists6.
D1.2recommendedthatcurricularneeds toaddress thechallengesofdatasciencebeingembeddedwithinbusinessatalllevelssuchthatpotentialismaximised.Thismodulehasbeenspecificallydesignedtohelpbridgethegapbetweentheharddatascienceskillsandtheirimplicationsonbusinesses.Bytheend of themodule learnerswill have a better business intelligence of the opportunity and risks ofexploitingdatascience,bothwithinexistingorganisationsandtobuildnewones.
Thedemandanalysisstudyevaluationreport(D1.4)makesanumberofrecommendationsforguidingthe development of the EDSA curricular. Table 1 from this study makes seven recommendations,summarisedas:
1. Holistictrainingapproach2. Opensourcebasedtraining3. Softskillstraining4. Basicdataliteracyskills5. Blendedtraining6. Datascienceskillsframework7. Navigationandguidance
Thedataexploitationincludingdatamarketsandlicensingcoursefocusesonmanyoftheseareas,inparticular.
1.Holistictrainingapproach
Most professional courseproviders offer very focused courses that go intodepth about howparticulartools,suchasHadoop,R,PythonorSPSS,canbeusedbybusinesses.
6Bigdata:Thenextfrontierforcompetition-http://www.mckinsey.com/features/big_data
Page34of36EDSAGrantAgreementno.643937
- EDSAstudyevaluationreport(D1.4,4.2.1,p65)Thedataexploitationincludingdatamarketsandlicensingcoursefocusesonsoftskillsthatareeasilytransferableintodifferentdomains.Thecourseintroducesavarietyofmodellingtoolsthatcanbeusedinmanydomainsandsectors;forexample,thelesson“Businessmodelsindatascience”introducesthebusinessmodelcanvaswhichisawidelyusedtoolforidentifyinggreatbusinessopportunities.
Thisholisticapproachwillhelpequipamoderndatascientistwiththetheoreticalknowledgeandsoftskillsrequiredtohelpintegratedatascienceintomanydifferentmarkets,sectorsandbusiness.
2.Opensourcebasedtraining
Themajorityofskillsdevelopmentindatascienceisachievedthroughthesemeans[opensourcetoolsandresources].
- EDSAstudyevaluationreport(D1.4)Thedataexploitationincludingdatamarketsandlicensingcoursewillbeofferedasafreeinteractiveonlinecourse,builtitselffromopensourcesoftware.Additionally,allpracticalexercisesavailablewithinthecourse(aswellasthemajorityofthoselinkedtoexternally)willbebaseduponfreelyavailableopensourcetoolsandmodels.Thisgiveslearnersthemaximumpotentialforself-studyandlearningwithoutanyfinancialorotherbarrierstousingtools.
3.Softskillstraining
Data scientists are often hired with high expectations regarding their abilities to transformbusinesstacticsandstrategies;thussoft-skillssuchastheseareseenasdesirableandneedgreaterfocusindatasciencetraining.
- EDSAstudyevaluationreport(D1.4) ThedataExploitationincludingdatamarketsandlicensingcourseisentirelyfocussedontheaspectsofhowtotransformtheirapplicationsintobusinesstacticsandstrategies.Thethreecoursesalltakeakeybusinessfocusandlooknotjustathowdatasciencefitswithindifferentbusinessmodelsbutalsohowthiscanhelpchangeasectorandwhatstrategiestotake.
Thishoweverrequiresstrongpresentationandcommunicationskillsinordertoinfluenceseniormanagementandotherfunctionaldepartmentstomaketherightdecisionsbasedondata.
- EDSAstudyevaluationreport(D1.4)Inthesecondpartofthecourseon“Datasciencemeansbusiness”therearelessonsspecificallyonhowtocreateandpitchavaluepropositiontobusinessleadersinordertobringaboutchange.Theseskillsareessentialforleadingdatascienceinnovators.
4.Basicdataliteracyskills
Onekeyaspectofdataliteracyisunderstandingwhendatashouldnotbeusedtoinformadecision.Specifically,inthefirstpartoftheDataExploitationincludingdatamarketsandlicensingcoursethereisalessonentitled“Managingriskandreward”.Thislooksspecificallyabouttherisksofdatascienceblurringboundariesbetweendisciplinesthatresultsinalossofexpertiseandrigorofresultsthatcancreateanegativeimpact.
5.Navigationandguidance
EachlessoninthedataExploitationincludingdatamarketsandlicensingcoursehasbeendesignedtostandalone.Theintroductorycontentofeachlessonoutlinestheskillsthatwillbeacquiredandhowtheyarerelevanttoamoderndatascientist.Linkswillbemadetootherlessonswherenecessary.Themainnavigationpageforthelessonswillallowuserstochoosetheirpathwaythroughthelessons.Whileitwillberecommendedthatusersfollowthelessonsinorder,interlinkingthelessonswillmeanthisisnotstrictlynecessarytoaccessrelevantcontent.
Attheendofeachlesson,learnerswillbeguidednotonlytothenextlessonbutalsotoanyexternalexercises,contentandcoursesthroughwhichtheycanexpandtheirknowledge.
D2.6Learningresources3Page35of36
2018©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
6.5 FurtherdevelopmentplansThiscoursewillbemaintainedandfurtherdevelopedintwokeyways:
1. Increase the number of external links to beginner level and related data science courses: Asdiscovered by the demand analysis, well-structured learning for data science beginners islacking.TheODIwillcontinuetomaintainanyrelevantlinksinthelearningresourcestoensurerelevancetolearners.
2. Repackageasacommercialproduct:ThecoursewillbeofferedasacommercialproductbyODI,asdescribedintheODIexploitationplanpresentedintheProjectExploitationReport(D5.4).
Page36of36EDSAGrantAgreementno.643937
7. ConclusionThe EDSA courses portfolio has incorporated awide range of learning resources, either offered byprojectpartnersorbythirdparties.ThisdeliverablehaspresentedthemodulesthathavebeenaddedtotheEDSAcoursesportfoliointhethirdyearoftheproject,inordertocoverthefollowing4topicsofthefinalEDSAcurriculum.:
● Programming/ComputationalThinking(RandPython)● DataIntensiveComputing● SocialMediaAnalytics● DataExploitationincludingdatamarketsandlicensing
In particular, we have presented an overview of the objectives and learning outcomes of the newmodules,wehaveidentifiedthetypesoflearningmaterialsusedandhowthesearedelivered,andwehavediscussedhowthenewmodulesarerelatedtotheEDSAcurriculumandthedemandanalysis,aswellastheplansinplacefortheirfurtherdevelopment.