2014: Treparel Big Data Text Analytics & Visualization

Embed Size (px)

DESCRIPTION

Text and content analytics have become a source of competitive advantage, enabling business, government agencies, and researchers to extract unprecedented value from unstructured data. Treparel (Delft, The Netherlands) is a independent provider of Text analytics and Visualization software. Organizations like Philips, Bayer, Abbott, NXP Semiconductors are using KMX Text Analytics software to gain faster, reliable, precise insights in large complex unstructured data sets. The KMX API allows software and service companies to enhance their unstructured data analysis capabilities by embedding world class machine learning based clustering, categorization and visualization. A recent review by IDC states: “KMX visualization capabilities around its auto-categorization and clustering offer immediate insight into unstructured data sets and appear to be adaptable and customizable to customer needs. Its approach to auto-categorization utilizes statistical principles and machine learning that require significantly less training and tuning on the part of customers than other approaches.”

Text of 2014: Treparel Big Data Text Analytics & Visualization

  • 1. Introducing Treparel: Big Data Text Analytics & Visualization applicationsTreparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.comJeroen Kleinhoven CEO jeroen@treparel.comFebruary, 2014

2. IndustryThoughtLeadersaboutTreparel TreparelKMXsvisualiza(oncapabili(esarounditsauto-categoriza8on andclusteringoerimmediateinsightintounstructureddatasetsand appeartobeadaptableandcustomizabletocustomerneeds.Itsapproachto auto-categoriza8onu8lizessta8s8calprinciplesandmachinelearningthat requiresignicantlylesstrainingandtuningonthepartofcustomersthan otherapproaches.DavidSchubmehl,IDCAsweacquiremoreandmoreinforma8on,weneedtoolsthatwillguideus throughthedatamaze.Analystsneedtoolstohelpthemunderstand paGernsanddeneclusters.Usersneedtoexploredatatouncover rela8onshipsfromscaGeredsources.TreparelsKMXservesboththese needswithitsabilitytoclusterandcategorizecollec8onsofdatawithahigh degreeofaccuracy,anditsinterac8vevisualiza8ontoolsthatenable explora8onoflargedatasets.SueFeldman,Synthexis.com(author: TheAnswerMachine. Treparel KMX All Rights Reserved 2013www.treparel.com2 3. Someofourclients&partners KMXisanintegralpartofourIPanalysistoolbox.Itcontributestoour capabilityofmakingaddedvalueIPanalysesoftechnologiesand compe8torstosupportstrategicdecisionmaking.Wevespeedupourpatentsearchesfrom2daysto2hoursusing KMXtechnologywww.fusepool.euTreparel KMX All rights reserved 20143 4. KeyBusinessProblemsTreparelKMXsolves Applica'onAreaBusinessproblemValueIP&PatentSearch HowtoimprovetheBme- consumingandcostlymanual search-processofpatents.ReduceresearchBme,improve precision&recallofrelevant documents.ImprovelegalposiBon anddrivemorerevenuefromIP.Compe''veAnalysis Howtoincreaseknowledgeon compeBtorsbygaining clusteredinsightsfrom(semi-) publicsources.ImprovecompeBBveadvantageby determininginternaBonalstrategy, productroadmap,R&Dplanning, markeBngcampaignsandcustomer senBment.Healthcare HowtoidenBfyhealthrisksand ndcorrelaBonsindeceasesor medicaldefects.EarlyidenBcaBononhealthrisksby cross-disciplineanalysesonmedical records,clinicalobservaBonsand medicalimages.Media&Publishing Howtoimprovesearchand contentanalyBcsonlarge volumesofpublicaBons.TextanalyBcsembeddedinpublishing improvesrelevanceandaccuracyof searchandshowspreviouslyhidden documents.Treparel KMX All Rights Reserved 2013www.treparel.com4 5. KeyBusinessProblemsTreparelKMXsolves-2 UseCasesBusinessproblemValueSen'mentAnalysis Howtomanagecurrentand futurecustomersandtheir interacBonsDerivingsenBmentfromcriBcal customer-basedtextsourcescan driverevenue,saBsfacBonand loyaltyVoiceofCustomer AnalyzingHR-relatedinformaBon HowtomanagecommunicaBons (likeCVsandprojects)tomatch andinteracBonswithemployees, demandtosupply. managers,subordinatesand employmentcandidates eDiscovery HowtomanageandmiBgate generalliBgaBonriskandcostin largesetsoftextandemails.TextanalyBcsappliedto legaltrialsorinlawsand jurisprudenceimprovesaccuracy inlegalcasesandlowerscosts.Predic'veAnalysis HowtoidenBfyearlysignsof requiredmaintenancethataect customersaBsfacBonand operaBonalcostsUsecustomersaBsfacBonsurveys onfoodqualitytoidenBfyairplane ovensrequiringmaintenancetune- ups 5 6. Part1: KMX:ReadytoUseTextAnalyBcs Intui8veContentClustering, Classica8on&Visualiza8onTreparel KMX All rights reserved 2014www.treparel.com6 7. KMXTextAnalyBcsApplicaBonoverview Query & Search ToolsAcquiredocumentsTextPreprocessingandIndexingClusteringClassicaBonVisualizaBonSemanBcAnalysisKMXuniquefuncBons: Extractconceptsincontext usingclusteringand classicaBonofdocuments UseclassicaBontocreate rankedlistsandtotagsubsets SupportofbinaryandmulB- classClassicaBon EnterpriseediBon(server/ cloud)&ProfessionalediBon (desktop) IntegraBonwithother applicaBonsthroughKMXAPITaxonomies, OntologiesPresentResults Treparel KMX All rights reserved 20137 8. Clustering:UserUnsupervisedAnalyBcs Benets:Getquickinsightsthroughautomatedvisualclusters withannotaBonstoenhancethediscoveryprocess 1. AnalyzetheclustersandtherelaBonshipsinthedata 2. Exploreoutliersinthedata 3. Finddocumentsofinterest Whatitdoes:AvisualizaBonofclusterswherethedocuments aredisplayedaspointsandthedistancebetweenthemshows theirsimilarity. WhatKMXdelivers:UseKMXtodo: 1. 2. 3. 4.Performtextpreprocessing(stemming/tokenizaBonetc) Calculatebetweenalldocumentsasimilaritymeasure CalculatevisualizaBon(landscape)withautomaBcannotaBon CreatethevisualizaBon AsastaBcimage OrprovideinteracBonwheretheusercanzoomin/outwith supportforadapBveannotaBonTreparel KMX All rights reserved 2014www.treparel.com8 9. ClassicaBon:UserSupervisedAnalyBcs Benets:Findingfast,accurateandprecisesmallresultsetsandenablingtrend reporBngandAlerBngbyreusingpredenedcategorizaBonmodels. 1. Obtainarankedlistofthemostrelevantdocuments 2. Separatetheimportantdocumentsfromtheirrelevantdocuments(noise) Howitworks:Alistoftherelevantdocumentsdenedfromausers perspecBve. WhatKMXdelivers:UseKMXtodo: 1. Tag(label)asmallnumberofrelevantandirrelevantdocuments UsesearchtoidenBfydocumentsthatneedtobetagged Performmanualtagging SelectdocumentsinteracBvefromthevisualizaBon(brushing) 2. CreateaClassier(categorizer)usingthetaggeddocuments 3. AutomaBcallyperformtheclassicaBononalldocuments 4. Obtaintheimportantdocumentsasrankedhighandtheirrelevant documentswhicharerankedlow Treparel KMX All rights reserved 2014www.treparel.com9 10. VisualizaBon:DiscoveringUnexpectedInsights Benets:KMXVisualisaBonsaresupporBng theprocessofconstrucBngavisualimage inthemindtounderstandthedatabe_er. Howitworks:KMXoersavisualizaBonframeworkwithvariousmethodsfor seeingtheunseen.Itenrichestheprocessofdiscoveryandfostersprofound andunexpectedinsights. WhatKMXdelivers:DierentvisualizaBonsorvisualpipelinesto: Comprehendlargedatasets,datasetsthataretoolargetograspbymental imaginaBon. DiscoverpreviousunknownproperBesofthedatasetthatmaynothave beenanBcipated Revealinherentproblemsofthedata,forinstanceerrorsandartefacts Examinelarge-scalefeaturesofthedatasetaswellasthelocalfeaturesor allowstheusertoseelocalfeaturesinalargerscalereference Letusersformhypothesisbasedonthe(newly)observedphenomenaor developedinsightsTreparel KMX All rights reserved 2014www.treparel.com10 11. Add-onservers: AutoReporBng&BatchClassicaBon AutoRepor'ngServer Supportautomatedanalysisforaggregated resultsformulBpleusers Pie&barcharts LandscapevisualizaBonsforoverviewof subjects EnablingrichinteracBonviawebinterface Classica'onBatchServer high-performancestand-alonetext- classicaBonserver EnableslargescaleparallelprocessingTreparel KMX All rights reserved 2014 Page 11www.treparel.com11 12. BusinessValuefromContentwithKMX TextAnaly'csforAnyoneandEveryoneIntuiBvetouseandlearn.Designed foreveryuser:business(infoconsumers)andscienBc(infocreators). InstantBusinessInsightsExploreallofyourunstructureddata(text,blogs, email,patents)withoutlimits. RapidTimetoValue-Adaptableandcustomizabletousersneeds.No implementaBonorextensiveandexpensivemodellingordevelopment. Signicantlesstrainingandtuning. AnysizedeploymentMeetseverybusinessneedfromasingleusertolarge mulBleveltypeusergroups. LanguageindependentSearchandanalyzemostoftheworldslanguages usingmachinetranslaBon. Anykindordeployment-Useitfromyourdesktoporina-private-cloud.Buy thesocware-as-a-serviceorgettheoutput-as-a-service. Enterprise-proven,IP&ITfriendlySuccessfullydeliveringvaluetoIP,business andmarketsinmulBnaBonalcompanies. Integra'onUsetheKMXAPItoincreasethevalueofunstructureddatainyour IPdiscoveryinfrastructure www.treparel.comTreparel KMX All rights reserved 201212 13. Part2: KMXsocware: UserInterface,keyfunc8ons&valueTreparel KMX All rights reserved 2014www.treparel.com13 14. KMX:Model,Analyse,DiscoverandVisualize inoneviewanddeployittolargescale Searchand highligh'ngBrushingFilteringDocumenttextLandscapevisualiza'onwww.treparel.com rights reserved 2014 Treparel KMX AllColoringofclassica'onscore 14 KMX Example: Ebola, SARS, Bird flue: How do they relate? 15. KMX:OpBmizeOutput usingClassicaBonPerformanceTuning Precision And RecallDocument classica'on forthree classesDistribu'onofclassica'onscores www.treparel.com rights reserved 2014 Treparel KMX All15 16. UseCase1:PerformingsmalltolargescaleSWOT analysis(onAstraZenecapatents) SWOTanalysisexample Startwithremovingirrelevant patentsusingClassica8onand Filteringtodetermine: Whoaretheimportantplayers (assignees,inventors)? Wherearetheimportantpatents led(countries)? WhatisthetrendoverBme(growth ofpatentsovertheyears)? NB:weuseda(very)simplequeryto nd986patentsledunder Astrazeneca. Patent DatabaseQueries+10.000 patentsRankingFilteringRankingFiltering986 patents29 patentsRankingFilteringBusiness User Treparel KMX All rights reserved 2014Output 17. LandscapingandRanking: From986tothemostrelevantpatentsFig: Using vlsual selection (brushing) to build a classification model (Classifier) to be able to rank the full data set and to extract the most relevant.17 18. LandscapingandRanking:WhataremostrelevantRespiratory&Inamma8onpatents? Yellow = most important patents (+80% score) Blue = least relevant patents (for this analysis)NB: crosshair points to 1 specific patent (full text in left pane)Fig: Ranked patents using a Classifier for Respiratory & Inflammation patents (In yellow the selection of 29 18 absolute relevant patents to be further analyzed). We used respiratory to demonstrate highlighting capabilities. 19. HowReliable&Accuratearetheresults?Reviewyourresultswithadvancedperformancetools ThequalityoftheautomaBcclassicaBon(categorizaBon)isshowninthe histogram,whereasmallnumberofdocumentswithahighclassicaBonscore areseparatedfromthelargenumberofdocuments.Fig: Classification performance 1280 patents on biomassNonrelevantdocumentsRelevantdocumentsKMXcalculatesthePrecisionandRecalloftheresultsusingcrossvalidaBon.PrecisionisessenBalfor:Firstanalysis&AlerBngservices Recalliscrucialfor:FreedomtoOperatesearch,ValiditysearchPa