PrivacyOverviewandDataMining
CSC301Spring2018
HowardRosenthal
CourseNotes:
� Muchofthematerialintheslidescomesfromthebooksandtheirassociatedsupportmaterials,belowaswellasmanyofthereferencesattheclasswebsite
Baase,SaraandHenry,Timothy,AGiftofFire:Social,Legal,andEthicalIssuesforComputingTechnology(5thEdition)Pearson,March9,2017,ISBN-13:978-0134615271Quinn,Michael,EthicsfortheInformationAge(7thEdition),Pearson,Feb.21,2016,ISBN-13978-0134296548
2
LessonGoals
� Basicprinciplesinprivacy� Definingprivacy� Threatstoprivacy� Impactsoftechnologyonprivacy� Securingpersonalprivacy� Technologyexcursion–DataMining
3
4
ThereAreManyAspectsToSecurityandPrivacy
5
WhatIsPrivacy?� IsprivacyaNaturalRight
� Isprivacyatypeofproperty?� Ifyouinvadeaperson’sprivacyitcanbeamajorcoerciveforce
� Privacyusedtobefairlysimple� Yourhomecouldnotbeinvaded,noryourpropertyseized,without
dueprocess� Todayyourprivateinformationiseverywhere
� Onthenet� Onyourphone� Onyourcomputer� Inthecloud� Inyouremployer’sdatabases� Withthegovernment
� Evenifthepeopleyougiveinformationtodonotmisusethatinformation,theinformationismoresusceptibletotheftviahackingorothermischiefthaneverbefore� RecentlytheFederalGovernment’sOfficeofPersonalManagement
washackedanddetailedinformationoneveryonewithasecurityclearancewasstolen
� Governmentacceptedverylittleresponsibilityforthistheft6
ThereAreThreeKeyAspectsToPrivacy� Freedomfromintrusion� Controloverinformationaboutoneself� Freedomfromsurveillance(physical,electronic,etc.)
7
OurPrivacyIsAlwaysBeingThreatened� Therearemanythreatstoourprivacy
� Intentionaluseormisuseofinformationbybusinessesorgovernment
� Unauthorizedreleasetoinsidersbyinformationmaintainers� Theftofinformationbycriminalsorhostilegovernments� Inadvertentleakagethroughnegligenceorcarelessness
� Ourownactions,suchaspostingtoomuchdataontheInternetforeitherbenign(B)ormalicious(M)purposes� Givetoonecharityandtenotherswillcomeknocking(B)� Listof“offcolor”moviesyoumayhavewatched(M)-usedtodiscredityou
� Divorceproceeding(M)–sometimesusedbypoliticians� Stealingfinancialdata(M)–usedtoopenloans,buyhomes,etc.allinyourname
8
9
NewTechnologyCreatesManyNewOpportunitiesToInvadeOurPrivacy� Someofthesethreatscombinebothlowtechtechniques,suchas
eavesdroppingorlookingoverashoulder,withhightechtechniques� Governmentandprivatedatabases� Sophisticatedtoolsforsurveillanceanddataanalysis� Vulnerabilityofdata
� Searchenginescollectmanyterabytesofdatadaily.� Dataisanalyzedtotargetadvertisinganddevelopnewservices.� Whogetstoseethisdata?Whyshouldwecare?� Thissamedata,whenaggregated,createsadetailedbiographyofyou� Datacollectedforonepurposewillfindotheruses� Assumethateverythingincyberspaceisrecordedandreplicated
� Youcreatenewpotentialsecurityleakseveryday� Facebook� E-mails� Texts� Mapinstructions� Twitter� IfinformationisonapublicWebsite,itisavailabletoeveryone
� Ifyoupostpicturesofyourvacationwhileyouareonityoumaycomehometoanemptyhouse
10
Re-identification
� Re-identificationistheprocessofidentifyingindividualsusinganonymousdata.� Re-identificationhasbecomemucheasierduetothequantityofinformationandpowerofdatasearchandanalysistools
� Acollectionofsmallitemscanbeaggregatedtoprovideadetailedpicture
� Yoursearchhistorycouldidentifywhoyouare.� Workingbackwardsfromthemetadataisoftenpossiblewithenoughcomputingpoweranddata.
� Reportersoftenuseanonymousdataastheyworktowardsidentifyingindividuals.
� IfinformationisonapublicWebsite,itisoftenavailabletoeveryone
11
PersonalSecurityandPrivacyAreOftenThreatenedByOurOwnActions
12
EverythingYouAccessMayBeMonitored� SearchEngines
� Mayrecordallyoursearches� IfyousearchforabookonAmazonyou’llgete-mailsaboutthatbookor
otherseveryfewdays� Someofyoursearchesyoumaywanttokeepprivate
� Lookingforanewjob� Searchingforcertainspecificproducts� Medicalsearches
� Smartphones� Areoftentransmittinglocationdata
� Greatifaphoneislostorstolen� Horribleifahousethiefgetsthedata
� Passwordsandcodesforkeyaccountsareoftenstoredwithoutyourknowledgeandthenuploadedtothecloudwithotherdata� Ifthecloudishackedyourinformationmaybeonthemarketwithoutyour
knowledge� Contactlistscanbecompromised� Photosmaybegatheredandsubjectedtovariousformsofanalysis
� Software� Manypiecesofsoftwarerecordalltypesofdata� Thisdatamayultimatelybecollectedandanalyzed� Sometimesitsimplysitsforgottenuntilsomeonedecidestoseewhat’sthere
13
ManagingPersonalData–TerminologyandPrinciples(1)� Personalinformationisanyinformationrelatingtoanindividualperson
� Invisibleinformationgathering� Datacollectedwithoutyourknowledge
� Alwaysreadthefineprint� Howoftendoyouclickagreewhendownloading?
� Thisisanethicalissue� Cookies
� FilesaWebsitestoresonavisitor’scomputer� Secondaryuse
� Useofpersonalinformationforapurposeotherthanthepurposeforwhichitwasprovided� Saleofconsumerinformationyoumarketersorotherbusinesses� Useofinformationinvariousdatabasestodenysomeoneajob� UseofvehicleregistrationsbytheIRStofindpersonswithhigh
incomes� Useoftextmessagestoprosecuteforacrime� Usingyourinformationinanillegalmannerafterstealingorgleaningit
fromlegitimatesources
14
ManagingPersonalData–TerminologyandPrinciples(2)� Datamining
� Searchingandanalyzingmassesofdatatofindpatternsanddevelopnewinformationorknowledge
� Computermatching� Combiningandcomparinginformationfromdifferentdatabases(usingsocialsecuritynumber,forexample)tomatchrecords.
� Computerprofiling� Analyzingdatatodeterminecharacteristicsofpeoplemostlikelytoengageinacertainbehavior
15
InformedConsentProvidesAnEthicalFrameworkForInformationCollection� InformedConsent
� Youmustagreebeforeyourinformationcanbecollectedorused� Couldbeusedtopressureyouifyouaredeniedaservicewithout
agreeingtosharethisdata� LoJackcollectsinformationaboutyourcarlocationcontinuously–
wasthisinformedconsent� TheAAAtriedcollectinginformationbyaskingyouifyou’dliketo
hookdatacollectorsintoyourcar–thentheyreportedthatdatatotheinsurancesideofthehouse
� Twocommonformsforprovidinginformedconsentareoptoutandoptin:� Inoptoutapersonmustrequest(usuallybycheckingabox)that
anorganizationnotuseinformation.� Inoptinthecollectoroftheinformationmayuseinformationonly
ifpersonexplicitlypermitsuse(usuallybycheckingabox).� DiscussionQuestion:
� Haveyouseenopt-inandopt-outchoices?Where?Howweretheyworded?Wereanyofthemdeceptive?
16
FairInformationPrinciples� Abasicsetofprinciplesforbusinessestohandledatainanethicalway� Informpeoplewhenyoucollectdata� Collectonlythedatathatisneeded� Makeoptinyourdefault� Offeroptoutmethodsthatcanbeusedatanytime
� Itishardertoensureifalldataisdeletedifyouoptinandthenoptout
� Keepdataonlyaslongasisneed� Maintainaccuracyofdata� Protectthedata.Useallreasonablesecuritymethodstodoso.
� Developpoliciesforrespondingtolawenforcementrequests� Manygovernmentorganizationsaredevelopingguidelines
� FTCFairInformationPracticePrinciples.pdf
17
DataMining
18
http://www.tutorialspoint.com/data_mining/dm_quick_guide.htm
WhatIsDataMining?
� Dataminingisdefinedasextractinginformationfromhugesetsofdata.� Inotherwords,wecansaythatdataminingistheprocedureof
miningknowledgefromdata.� Dataminingcanintegratemanydifferentdatasets
� Theinformationorknowledgeextractedfromdataminingcanbeusedforanyofthefollowingapplications� Profiling–Thisiswhereprivacyreallygetsinvolved� CustomerRetention� PatternAnalysis� MarketAnalysis� FraudDetection� ProductionControl� ScienceExploration
19
DataMiningTasks� Dataminingdealswiththekindofpatternsthatcanbemined.Onthebasisofthekindofdatatobemined,therearetwocategoriesoffunctionsinvolvedinDataMining−� TheDescriptiveFunctiondealswiththegeneralpropertiesofdata
inthedatabase.� Class/ConceptDescription� MiningofFrequentPatterns� MiningofAssociations� MiningofCorrelations� MiningofClusters
� ClassificationandPredictionistheprocessoffindingamodelthatdescribesthedataclassesorconcepts.Thepurposeistobeabletousethismodeltopredicttheclassofobjectswhoseclasslabelisunknown.Thisderivedmodelisbasedontheanalysisofsetsoftrainingdata.Thederivedmodelcanbepresentedinthefollowingforms−� Classification(IF-THEN)Rules� DecisionTrees� MathematicalFormulae� NeuralNetworks
20
DescriptiveTasksInDataMining(1)� TheClass/ConceptDescriptionreferstothedatatobeassociatedwiththe
classesorconcepts.Forexample,inacompany,theclassesofitemsforsalesincludecomputerandprinters,andconceptsofcustomersincludebigspendersandbudgetspenders.Suchdescriptionsofaclassoraconceptarecalledclass/conceptdescriptions.Thesedescriptionscanbederivedbythefollowingtwoways−� DataCharacterizationreferstosummarizingdataofclassunderstudy.This
classunderstudyiscalledasTargetClass.� DataDiscriminationreferstothemappingorclassificationofaclasswith
somepredefinedgrouporclass.� MiningofFrequentPatternslooksatpatternsarethosepatternsthatoccur
frequentlyintransactionaldata.Thelistofkindoffrequentpatternsincludes� TheFrequentItemSetisasetofitemsthatfrequentlyappeartogether,for
example,milkandbread.� TheFrequentSubsequenceisasequenceofpatternsthatoccurfrequently
suchaspurchasingacameraisfollowedbymemorycard.� TheFrequentSubStructurereferstodifferentstructuralforms,suchas
graphs,trees,orlattices,whichmaybecombinedwithitem−setsorsubsequences.
21
DescriptiveTasksInDataMining(2)� MiningofAssociation
� Thisprocessreferstotheprocessofuncoveringtherelationshipamongdataanddeterminingassociationrules.
� Associationsareusedinretailsalestoidentifypatternsthatarefrequentlypurchasedtogether,helpingtoidentifypotentialbuyers� Forexample,aretailergeneratesanassociationrulethatshowsthat70%oftime
milkissoldwithbreadwhileonly30%oftimesarebiscuitssoldwithbread.� MiningofCorrelations
� Additionalanalysisperformedtouncoverinterestingstatisticalcorrelationsbetweenassociated-attribute−valuepairsorbetweentwoitemsetstoanalyzethatiftheyhavepositive,negativeornoeffectoneachother.
� Wanttounderstandifthereisactualcausation� MiningofClusters
� Clusterreferstoagroupofsimilarkindofobjects.� Clusteranalysisreferstoforminggroupofobjectsthatareverysimilar
toeachotherbutarehighlydifferentfromtheobjectsinotherclusters.
� Cangroupbygender,age,homelocation,language,….
22
ClassificationandPredictionFunctions
� Classification−Itpredictstheclassofobjectswhoseclasslabelisunknown.Itsobjectiveistofindaderivedmodelthatdescribesanddistinguishesdataclassesorconcepts.TheDerivedModelisbasedontheanalysissetoftrainingdatai.e.thedataobjectwhoseclasslabeliswellknown.
� Prediction−Itisusedtopredictmissingorunavailablenumericaldatavaluesratherthanclasslabels.RegressionAnalysisisgenerallyusedforprediction.Predictioncanalsobeusedfordistributiontrendsbasedonavailabledata.
� OutlierAnalysis−Outliersmaybedefinedasthedataobjectsthatdonotcomplywiththegeneralbehaviorormodelofthedataavailable.
� EvolutionAnalysis−Evolutionanalysisreferstothedescriptionandmodelregularitiesortrendsforobjectswhosebehaviorchangesovertime.
23
DataWarehousing
� Datawarehousingistheprocessofconstructingandusingthedatawarehouse.Adatawarehouseisconstructedbyintegratingthedatafrommultipleheterogeneoussources.Itsupportsanalyticalreporting,structuredand/oradhocqueries,anddecisionmaking.� Datawarehousinginvolvesdatacleaning,dataintegration,anddataconsolidations.Tointegrateheterogeneousdatabases,wehavethefollowingtwoapproaches−� QueryDrivenApproach� UpdateDrivenApproach
24
QueryDrivenApproach
� Thisisthetraditionalapproachtointegrateheterogeneousdatabases.� Buildswrappersandintegratorsontopofmultipleheterogeneous
databases.Theseintegratorsarealsoknownasmediators.� TheprocessoftheQueryDrivenApproach
� Whenaqueryisissuedtoaclientside,ametadatadictionarytranslatesthequeryintooneormorequeries,appropriatefortheindividualheterogeneoussiteinvolved.
� Nowthesequeriesaremappedandsenttothelocalqueryprocessor.� Theresultsfromheterogeneoussitesareintegratedintoaglobal
answerset.� Advantages
� Governmentdoesn’tgettokeepalargedatabaseofinformationonpermanentfile
� Don’tneedtomaintainalargeITinfrastructure� Disadvantages
� TheQueryDrivenApproachneedscomplexintegrationandfilteringprocesses.� Itisveryinefficientandveryexpensiveforfrequentqueries.� Thisapproachisexpensiveforqueriesthatrequireaggregations(constant
regrouping)ofdata
25
UpdateDrivenApproach� Today'sdatawarehousesystemsfollowupdate-drivenapproachratherthan
thetraditionalapproachdiscussedearlier.� Intheupdate-drivenapproach,theinformationfrommultipleheterogeneous
sourcesisintegratedinadvanceandstoredinawarehouse.� Thisincludesdatascrubbing–theprocessofvalidatingdataforcorrectnessin
advance� Thisinformationisavailablefordirectqueryingandanalysis.� Advantages
� Thisapproachprovideshighperformance.� Thedatacanbecopied,processed,integrated,annotated,summarizedand
restructuredinthesemanticdatastoreinadvance.� Inotherwords,westoredataintheway(s)wewanttolookatit
� Queryprocessingdoesnotrequireaninterfacewiththeprocessingatthelocaloriginaldatasources.� Muchlessintrusiveandresourceintensivetopullthedataonce,ratherthanwhenever
youwanttoquery� Disadvantages
� Mustmaintainalargeinfrastructuretoimport,storeandmaintaindata� Privacyconcernssincethegovernmentnowhasaccesstosomuchdata
� ThewholedebateonthePatriotActcenteredaroundwhetherornotthegovernmentcouldcontinuouslycollectandstoremetadatafromtheISPsandcell/land-linephoneproviders� Apolitical/privacyargumentconflictedwithatechnicalargument
26
DataWarehousingandDataMining� OnlineAnalyticalMiningintegrateswithOnlineAnalyticalProcessing
todiscoverknowledgeacrossmultidimensionaldatabases.
27
On-lineAnalyticalMining
� On-lineAnalyticalMining(OLAM)hasthefollowingimportantattributes� Highqualityofdataindatawarehouses
� Thedataminingtoolsarerequiredtoworkonintegrated,consistent,andcleaneddatawhichareverycostlyinthepreprocessingofdata.
� ThedatawarehousesconstructedbysuchpreprocessingarevaluablesourcesofhighqualitydataforOLAPanddataminingaswell.
� Acomplexinformationprocessinginfrastructuresurroundseachdatawarehouses� Informationprocessinginfrastructurereferstoaccessing,integration,
consolidation,andtransformationofmultipleheterogeneousdatabases,web-accessingandservicefacilities,reportingandOLAPanalysistools.
� On-lineAnalyticalProcessing(OLAP)−basedexploratorydataanalysis� Exploratorydataanalysisisrequiredforeffectivedatamining.� OLAPprovidesfacilitiesfordataminingonvarioussubsetofdataandat
differentlevelsofabstraction.� Onlineselectionofdataminingfunctions
� IntegratingOLAPwithmultipledataminingfunctionsandonlineanalyticalminingprovidesuserswiththeflexibilitytoselectdesireddataminingfunctionsandswapdataminingtasksdynamically.
28
StepsInDataMining� DataCleaning
� Thenoiseandinconsistentdataisremoved.� DataIntegration
� Multipledatasourcesarecombined.� DataSelection
� Datarelevanttotheanalysistaskareretrievedfromthedatabase.� DataTransformation
� Dataistransformedorconsolidatedintoformsappropriateforminingbyperformingsummaryoraggregationoperations.
� DataMining� Intelligentmethodsareappliedinordertoextractdatapatterns.
� PatternEvaluation� Datapatternsareevaluated.
� KnowledgePresentation� Knowledgeisrepresented,oftengraphically
29
30
TheProcessofKnowledgeDiscovery
Multi-DimensionalDatabases
� Multidimensionalstructuresuseavariationoftherelationalmodeltoorganizedataandexpresstherelationshipsbetweendata.� Morecomplexthanthetypicalrow/columnrelationaldatabase.Eachcellwithinamultidimensionalstructurecontainsaggregateddatarelatedtoelementsalongeachofitsdimensions
� Timeisanadditionaldimensionusedintheanalysisofdata
31
ExampleOfAMulti-DimensionalDatabaseStructure
32