Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
U N C L A S S I F I E D P U B L I C
VALCRIWHITEPAPERSERIESVALCRI-WP-2017-0111February2017EditedbyB.L.WilliamWong
ApplyingVisualInteractiveDimensionalityReductiontoCriminalIntelligenceAnalysis
DominikSacha1,WolfgangJentner
1,LeishiZhang
2,FlorianStoffel
1,GeoffreyEllis
1and
DanielKeim1
1UniversityofKonstanz78457KonstanzGERMANY
2MiddlesexUniversityLondonTheBurroughs,HendonLondonNW44BTUNITEDKINGDOM
ProjectCoordinatorMiddlesexUniversityLondonTheBurroughs,HendonLondonNW44BTUnitedKingdom.
ProfessorB.L.WilliamWongHead,InteractionDesignCentreFacultyofScienceandTechnologyEmail:[email protected]
Copyright©2016TheAuthorsandProjectVALCRI.Allrightsreserved.
U N C L A S S I F I E D P U B L I C
2
I N T E N T I O N A L L Y B L A N K
U N C L A S S I F I E D P U B L I C
ABSTRACTVALCRI provides a challenging and overwhelming high-dimensional dataset that comprises of hun-dreds of extracted semantic features in addition to the usual spatiotemporal information ormeta-data.Toovercomethecurseofdimensionalityand togenerate low-dimensional representationsofthesesemanticfeaturesweapplyinteractivehigh-dimensionaldataanalysistechniqueswiththegoalofobtainingclustersofsimilarcrimereports.However,itisstillachallengeforcrimeanalyststomakesenseoftheresultsandtoprovideusefulinteractivefeedbacktothesystem.Therefore,weprovideseveraltightlyintegratedinteractivevisualizationsthatallowtheanalyststoidentifyclustersofsimi-larcrimesfromdifferentperspectivesandinteractivelyfocustheiranalysisonfeaturesorcrimerec-ordsofparticularinterest.
KeywordsCriminalIntelligence,High-DimensionalDataAnalysis,FeatureExtraction,DimensionalityReduction,VisualAnalytics
U N C L A S S I F I E D P U B L I C
4
I N T E N T I O N A L L Y B L A N K
U N C L A S S I F I E D P U B L I C
5
INTRODUCTIONA major task for crime analysts is to identify and groupcrime reports according to their similarity. However, the“similarity”ofcrimesthat iscomputedbasedonextractedsemanticfeaturesfromthetextualcrimereportdescriptionmaybedefinedindifferentwaysandbespecificforavarie-tyofanalysistasks,crimetypes,aswellasgeographic,andtemporalcharacteristics.Thisflexibilityrequiresanexplora-tory analysis of crime similaritieswith overviews that canbeiterativelyrefined.Inaddition,theautomaticprocessingof text documents in VALCRI provides the analyst with alargenumberof extracted features for each crime report,resultinginachallengingdatasetthatmaycontainmillionsofdatarecordsandhundredsofextractedfeatures.Ourapproachtotacklingthishigh-dimensionaldatasetistoapplyadimensionality reduction (DR)pipeline thatcanbesteeredbyinteractivevisualizationsthatareintegratedintotheVALCRI framework.Ontheonehand,oursolution letsthe analyst apply external (e.g., geographic or temporal)filterstothedatasetallowingthemtoinvestigateandiden-tifyfeaturecharacteristicsoftheremainingcrimes.Ontheother hand, our approach provides the analyst with low-dimensionalrepresentationsofcrimesthatenabletheana-lyst to investigate and identify clusters/groups of similarcrimes in several perspectives. The domain expert is sup-portedtoprovide feedbacktothesystemateverystepofthepipeline.This paper describes our work in progress. A major chal-lenge of effective user involvement is in translating theirfeedbackinthedifferentcomponentsintermsofappropri-ateDRpipeline adaptations and recomputations. This is aresearchchallengethathasbeenrecognizedintheMLandVA communities (Sachaet al., 2016). To this end,wepro-vide the analyst with intuitive visualizations and provideguidance based on automated measures. Our approachhides computational complexity but allows the analyst toidentify interesting patterns, such as groups of similarcrimesoroutliers.Themajor steps of our solution are 1) feature extractionfromtextualcrimereportsandaweightedsimilaritymodelasapre-processingstep,2)calculationandvisualizationoffeature characteristics, 3) use of different algorithms toproducelow-dimensionalembeddingsofthedata,4)visualinteractiverepresentationsofsimilarcrimeclusters,and5)adaptive computations of results. The remainder of thispaper provides details of the analyzed dataset and high-dimensional data analysis, describes the DR pipeline withits interactivevisualizationsandexplainshow it integrateswiththetablecomponents.
DATAPROCESSINGANDFUNDAMENTALSONHIGH-DIMENSIONALDATAANALYSIS
SemanticCrimeCaseAnalysisTheVALCRI text processing pipeline extracts a large num-beroffeaturesfromcrimereportsbasedonsemanticwordlists or ontologies about known and domain specific con-cepts.Forexample,all termsthatdescribematerials (cop-per, steel, wood, etc.) can be identified and extracted asfeaturesthatbelongtoacommon“materials”conceptcat-egory. Similarly, physical parts that belong to a house(door, floor, kitchen, etc.) can be identified and extractedasfeaturesthatbelongtoacommon“buildingparts”cate-gory. Hence, an analysis question could be to distinguishburglaries that identify doors according to the associatedmaterials(e.g.,“woodendoor”vs.“steeldoor”)orwhetherthedoorswere“locked”or“unlocked”.Thetextprocessingpipelineprovidesnumerousofthesesemanticfeaturesthatcan be transferred to a binary feature vector for everycrime record (a vector that describes whether a specificconceptwasidentifiedornot). SimilarityMappingandFeatureSelectionForcomparativecaseanalysis,analystswanttofindoutthesimilaritiesbetweencrimecases to support reasoningandsensemaking.Givenacollectionofcrimes,itispossibletoapplyametric (forexample,Euclideandistance that takesinto consideration the number of common con-cepts/features in two crime cases) to measure the dis-tance/dissimilaritybetweenpairwisecrimecases,resultinginaso-calleddistancematrix(seeTable1asanexampleforUS-cities). InVALCRIwemakeuseofaweightedsimilaritymetric that is computed based on a feature importance.Feature selection can be interactively defined based onsemantic relations between features (e.g., select all thebuilding parts and materials), but also semi-automaticallyusingcorrelationanalysisorsubspaceclustering.Table1.DistancematrixbetweensevenUScities BOSTON NY DC CHICAGO SEATTLE SF LABOSTON 0 206 429 963 2976 3095 2979
NY 206 0 223 802 2815 2934 2786DC 429 223 0 671 2684 2799 2631
CHICAGO 963 802 671 0 2013 2142 2954SEATTLE 2976 2815 2684 2013 0 808 1131
SF 3095 2934 2799 2142 808 0 379LA 2979 2786 2631 2954 1131 379 0
VisualEmbeddingTechniques(DimensionalityReduction)Acommonapproachtoanalyzingthesimilarityistoembedthemintoavisualdisplayasascatterplotwhereeachpointrepresentsa crimeanddistancesbetweenpairwisepointsindicatestheirdissimilarities, i.e.,similarcrimesareplacedclosetoeachother,anddissimilaronesare farapart.Thisallowstheanalysttoseethepatternsinthedata.Figure1showssomeexamplevisualembeddings.
U N C L A S S I F I E D P U B L I C
6
Figure1.Visual embeddings that show similarity andpatterns inthedata.Colorsindicatedifferentclasslabels.
Variousdimensionalityreductionalgorithmsexist forcom-puting the visual embedding of a dataset (Maaten et al.2009).Theideaistobestapproximatethedissimilaritybe-tweenobjects in thedata space to theEuclideandistancebetweenpointsinthevisualdisplaysothatgroupsofsimi-larobjectandoutliers,canbeeasilyidentified.However,itis impossible foranyalgorithmtoguaranteea100%accu-rate mapping as different algorithms have differentstrengthsinshowingpatternsindata.ForVALCRIwechosethreealgorithms:PCA,MDS,andt-SNE.PCA istheancestorofallembeddingalgorithmsandisstillthemostwidely usedmethods across different disciplinesandapplications.Themethodhasthestrengthofhighlight-ing linear patterns in the data (see the left embedding inFigure1asanexample). Thevisualization isbasedonthetoptwoprincipalcomponents,eachofwhichisdeterminedbyasetoffeatureswithadifferentlevelofinfluence.The-se components reflect linear combinations of the largestvariation of the original features. Although classic MDS(KruskalandWish,1978)issimilartoPCA,non-metricMDS
allows the analysts to find the nonlinear patterns in thedata (see the rightembedding inFigure1asanexample).Distance-based techniques, such asMDS, aim to preservethedistances(e.g.,Figure1) intheembedding.Theprinci-pleofnon-metricMDS is anonparametric andmonotonicdistance transformation that is identified with either iso-tonic optimization or semi-definite programming. t-SNE isoneofthestate-of-the-artembeddingalgorithmsthathavethe strength of highlighting the nonlinear patterns in thedataandisanexampleofaneighborhoodpreservingtech-nique.Thealgorithmoutperformsmanyothermethods inthe visual quality in terms of clearer separation betweenclusters (Maaten and Hinton 2008). In VALCRI we imple-mentedallthesedifferentmethodstoallowtheanalysttovisualize the data from different perspectives (linear, dis-tance-based, and neighborhood-preserving) in order tovalidate and verify clusters and outliers across techniquesinanexploratorymanner.
S3–SIMILARITYSPACESELECTORWeimplementedthedescribedtechniquesinaninteractiveDRpipelineandgivetheanalystawaytointeractinseveralvisualizations.Wefirstprovideanoverviewofthepipelinestepsandsubsequentlydescribethedifferentvisualizationsinmoredetail.PipelineOverviewOur visual analyticspipeline, shown in Figure2, illustratesthecomputationalsteps(top),accompanyingvisualizations(middle)and interactive feedbackof theanalyst (bottom).In the very first step (Feature Extraction) the features areextracted from crime report text documents and trans-ferredtobinaryfeaturevectors.
U N C L A S S I F I E D P U B L I C
7
We implemented a tabular andmatrix-based visualizationthatshowsthesemanticfeaturesofselectedcrimereports(seenextsectionformoredetails).Inthesecondstep(Pre-Processing) the feature weight (or importance) model iscreated and visualized as bar charts. Based on these fea-ture-weights, the pairwise distances are prepared for thesubsequentDRcomputations.Thefeatureselection inthisstage is supported by Pearson correlation analysis or se-mantic concept relationships. In theDR step, threediffer-entvisualembeddingscanbecomputed(PCA,MDS,t-SNE)and a k-means clustering is computed in the visual space(obtained x and y coordinates). The identified clusters arebroadcast to the other components in the VALCRI frame-work(e.g.,crimetables,maps).Theanalystcanselectandhighlightclustersofinterestandadaptthesimilaritymodel.Furtherexternalchangestotheanalyzeddata(settinggeo-graphic, temporal or concept-based filters) will cause animmediaterecomputationofthepipeline.Oncetheresultsarecomputedweupdateallthevisualiza-tions accordingly and animate between the previous andthe new states. This allows the analyst to track changes,investigate the similarities of the crimes, and to fine-tunethe DR-pipeline by feature and data selections as well asparametertuning.DataPreparationThedatatobeanalyzedcanbechangedbyexternalfiltersusingthesearchfunctionality,temporalselections,orgeo-graphic filters. Any changes to the input datawill cause arecalculationofthepipeline.WeightedSimilarityModel&FeatureSelectionA foundation for adapting the feature importance of thepipeline is a feature weight model in the pre-processingstep. The model covers an importance (or weight) valuebetween 0-100 for each feature (with a default value of50).Thisweightvectorcanbechangedbasedonuserfeed-backorautomatedfeatureselectionalgorithmsanddepictsthebasis for calculating thepairwisedistances. Thesedis-
tances are calculated using aweighted Euclidean distancemeasureandserveasan inputforthesubsequentDRandclusteringstep.Thefeatureweightsarevisualizedasabarchart(Figure3)beloweachscatterplot(Figure5).Hoveringover a bar reveals the feature names in tooltips whereasdraggingabarchangestheweight(between0and100)ofthatfeatureandwhenfinished,willcausearecomputationoftheDR.Thefeatureweightscanalsobechangedbyoth-er (external) components, suchasclickingonaconcept inthetable(see“CrimeTables”Sectionformoredetails).
Figure3.Barchart:Thefeatureweightsareshownasbars.Hover-ingwill reveal informationat thetooltip (as illustrated),whereasdraggingupanddownwillchangeafeatureweight.
Feature selection is further supported by computing fea-ture correlations in the analyzed dataset. Therefore, wecompute Pearson correlation coefficients and visualizethem in amatrix visualization (Figure 4). Positive correla-tions color the cells blue whereas negative correlationscolorthemred.Thisallowstheanalyststounderstandfea-ture relationships and to spot co-occurring (e.g., “smash”occurslikelywith“window”)aswellasthosewhichdonotco-occur (e.g., either “door” or “window” occurs, but notboth).Hoveringwillrevealthefeaturenamesandthecor-relationvalue.Clickingonacellwillsettherespectivefea-tureweighttozeroandcausearecomputationofthepipe-line. In the future, we plan to apply reorganization algo-rithmstorevealpatternsinthematrix.
Figure2.DRpipelineoverview:Computationstepsareshownontop,withtheirrespectivevisualizationsshownbelow.Theanalystisabletoadaptthepipelineduringtheanalysis,asshownatthebottom.
U N C L A S S I F I E D P U B L I C
8
Figure4.Correlationmatrixview:Hoveringoveracellwill revealthefeaturenamesandcorrelationstrength.Clickingonacellwillexcludetherespectivefeaturefromthecomputations.
DimensionalityReduction&VisualClusteringThefinalstepinthepipelineisthecomputationofthevis-ual embedding and an additional clusteringwithin the re-sultingtwo-dimensionalspace.Theanalystcanchoosebe-tween three different algorithms for different purposes(PCA - feature-based linear embedding,MDS - (large) dis-tance preserving, t-SNE - neighborhood preserving). Forclustering, we implemented the k-Means algorithm thatletstheanalystsetthenumberofdesiredclusters.Theob-tained clusters are visualized as colored dots within theconvexhullofalltheclustermembersinascatterplot(Fig-ure5).Hoveringovertheclusterareawillhighlighttheclus-ter,clickingwillselectit.Othercomponentsarenotifiedofany selections inorder, for instance, to reveal the respec-tivecrimes (linking&brushing).Similarly,externalcompo-nents can sendhighlight-selections to S3 to reveal thede-sireditemsinthescatterplot.Thisallowstheanalysttoseethe selected clusters in the other views, which can helpthem understand characteristics of the cluster (e.g., findoutwhichconceptsareimportantanddistinguishbetweendifferent clusters). Interactions, such as changing the fea-ture weights or choosing the DR algorithm will cause arecomputation of the pipeline. Once the computation isfinished,weanimatethecrimesmovingtotheirnewposi-tions in the scatter plot. This allows the analyst to keeptrackof changes and toperform“what-if” testing. For ex-ample,ananalystmayremoveafeatureandobservehowthevisualization(thedistancesandclusters)changes.Click-ing on the “flip” area on top of the visualization (scatterplotorcorrelationmatrix)allowstheanalysttoswitchbe-tweentheviews.
Figure5.Scatterplotandbarchart:Thebarchartshowsthefea-ture importancemodel fortherespectiveembeddingabove.Thescatterplotvisualizesthecrimesinatwo-dimensionalembeddingtogetherwiththeclusteringresult(colorandclusteroutline).
CRIMETABLEThedesignofthecrimetablecomponentisinspiredbyoneof the tools available to crime investigators. Itsmain pur-pose is to provide a quick overviewof concepts and theirunderlyingtermsthatareassociatedwitheachcrime.Cur-rently, this task ismainly solvedwith large tabular repre-sentationsusingstandardsoftwaretools,suchasMicrosoftExcel.Ourcomponentofferstwodifferentmodesthatcanbe viewed at any time. The Concept Classification Table(CCT)providesadetailedviewof termsorganizedbytheirsemantic concept affiliation. The second mode, calledCrimeConceptMatrix(CCM),featuresamatrix-basedover-viewvisualizationofconcept-termcombinations.ConceptClassificationTable(CCT)The default setting of the CCT view is shown in Figure 6.Each concept (or feature) that occurs at least once in theselectedsetofcrimesisrepresentedasarowinthetable.The crimesare listed column-wise anddenotedwith theircrime number on top. When a concept-term occurs in acrime it is shown in the respective cell. For example, theterms“left”and“walked”belongingtotheconcept“Mov-ingActions”areshowninthefirstcoloredcellofthetable.ThecolorsareprovidedbytheS3componentandrepresentclustersofsimilarcrimes.
U N C L A S S I F I E D P U B L I C
9
CrimeConceptMatrix(CCM)The CCT can bemorphed into the CCMwhere each con-cept-termcombination is representedas itsown row (seeFigure 7). In contrast to the previous example in the CCTsection,theterms“left”and“walked”arenowrepresentedasseparaterowsandareprefixedwiththeiraffiliatedcon-cept.AsintheCCT,thecoloredcellsdepictthattherespec-tive concept-term combination occurs in this crime. ThecolorrepresentstheclustermembershipasdefinedbyS3.
Figure7.CCM:Eachconcept-termcombinationisdisplayedrow-wise.
Both,thetableaswellasthematrixcanbezoomed.Zoom-ingoutprovidesanoverviewofalargenumberofcrimesorconcepts can help detect larger patterns. The terms arefadedoutinCCTmodewhenthezoomleveldecreasestoacertaindegreeinordertoprovidethesamelookandfeelas
the CCM.However, in contrast to the CCM, only the con-ceptsaredisplayedincombinationwiththecoloredcells.DRPipelineIntegrationTheS3andthecrimetablecomponentsaretightlyintegrat-ed. Besides the brushing of the clusters represented bytheir colors, the components are also interactively linked.Hovering over a cluster or single point highlights thecrime(s)asboldtextinthetable(seeFigure6,redcluster).Clickingona termwill change the font sizeandwillmakethe term in all cells appear larger and thus, more im-portant. This is shown in Figure 6 with the terms “stole”and“stolen”.Furthermore,aneventisbroadcastedupdat-ingtheweightsforthecurrentdataset.Theeventisinter-cepted in S3where the bar chart is updated and the cur-rentlyselectedprojectionmethodisusedtorecalculatethenewclusterswiththenewweightsettings.Eachrecalcula-tionmay update the clusters and thus,might change thecolors inthetablecomponent.Asecondclickonthetermwilldecreasethefont-sizetoaminimumandstrike-throughthetermimplyingthattheweightofthistermissettozero.InFigure6thistruefortheterms“removed”,“removing”,and“took”.Athirdclickputsbackthetermtothedefaultstate (50%). The same interaction is also possible in theCCM-modewhentheuserclicksonaconcept-termcombi-nation(seeFigure7).InCCT-mode,theusermayalsointer-actwithaconcept.Thiswillhidethewholerowinthetableand, therefore, set all weights of the terms belonging tothis concept to zero (see Figure 6 with concept “SimpleColors”).Asecondclickwillmaketherowreappear,puttingtheweightsbackintothedefaultstate.Asdescribedinthepreviouschapter,theusermayalsodefinetheweightsus-ing the bar chart causing the same weight event to bebroadcastedfromtheS3component.Theweightsareagainmappedtothefont-sizeandaresetasstrike-throughwhentheweightissettozero.
Figure6.TableComponent(CCT-Mode):Thecrimesarelistedinthecolumns.Eachrowrepresentsaconcept.Whenaconcepttermoccursinaspecificcrimeitiswrittenintothecell.
U N C L A S S I F I E D P U B L I C
10
CONCLUSIONS
Thispaperdescribesourworkinprogresstowardsapplyinghigh-dimensional data analysis techniques to crime caseanalysis with the goal to explore semantic similarities. Inthefuture,wewillfocusonsupportingandguidingtheana-lyst further in configuring the pipeline and the visualiza-tions. Therefore,weplan to investigate automatic featureselection algorithms (such as subspace clustering), matrixsortingalgorithms,andvisualqualitymetricstorevealpat-terns in the visualizations and to provide the analystwithrecommendations.
ACKNOWLEDGEMENTSThe research leading to the results reported here has
received funding from the European Union SeventhFramework Programme (FP7/2007-2013) through ProjectVALCRI,EuropeanCommissionGrantAgreementN°FP7-IP-608142,awardedtoMiddlesexUniversityandpartners.
REFERENCESJ.B.KruskalandM.Wish.Multidimensionalscaling,vol.11.
Sage,1978L. van der Maaten, .J.P.; Hinton, G.E. (2008). Visualizing
High-DimensionalDataUsingt-SNE.JournalofMachineLearningResearch.9:2579–2605.
L.vanderMaaten,E.Postma,andH.vandenHerik(2009).Dimensionality reduction: A comparative review.Technicalreport,TilburgCentreforCreativeComputing,TilburgUniversity.
D. Sacha, L. Zhang,M. Sedlmair, J. A. Lee, J. Peltonen, D.Weiskopf, S. C. North and D. A. Keim (2016).VisualInteractionwithDimensionalityReduction:AStructuredLiterature Analysis. IEEE Transactions on Visualizationand Computer Graphics (TVCG), DOI:10.1109/TVCG.2016.2598495.
U N C L A S S I F I E D P U B L I C
11
The research leading to the results reported here has received funding from the European Union SeventhFramework Programme (FP7/2007-2013) through Project VALCRI, European Commission Grant AgreementNumberFP7-IP-608142,awardedtoMiddlesexUniversityandpartners.
VALCRIPartners Country
1 MiddlesexUniversityLondonProfessorB.L.WilliamWong,ProjectCoordinatorProfessorIfanShepherd,DeputyProjectCoordinator
UnitedKingdom
2 SpaceApplicationsServicesNVMrRaniPinchuck
Belgium
3 UniversitatKonstanzProfessorDanielKeim
Germany
4 LinkopingsUniversitetProfessorHenrikEriksson
Sweden
5 CityUniversityofLondonProfessorJasonDykes
UnitedKingdom
6 KatholiekeUniversiteitLeuvenProfessorFrankVerbruggen
Belgium
7 AESolutions(BI)LimitedDrRickAdderley
UnitedKingdom
8 TechnischeUniversitaetGrazProfessorDietrichAlbert
Austria
9 Fraunhofer-GesellschaftZurFoerderungDerAngewandtenForschungE.V.Mr.PatrickAichroft
Germany
10 TechnischeUniversitaetWienAssoc.Prof.MargitPohl
Austria
11 ObjectSecurityLtdMrRudolfSchriener
UnitedKingdom
12 UnabhaengigesLandeszentrumfuerDatenschutzDrMaritHansen
Germany
13 i-IntelligenceMrChrisPallaris
Switzerland
14 ExippleStudioSLMrGermanLeon
Spain
15 LokalePolitieAntwerpen Belgium
16 BelgianFederalPolice Belgium
17 WestMidlandsPolice UnitedKingdom