2014: Treparel Big Data Text Analytics & Visualization

  • Published on
    28-Jan-2015

  • View
    110

  • Download
    2

Embed Size (px)

DESCRIPTION

Text and content analytics have become a source of competitive advantage, enabling business, government agencies, and researchers to extract unprecedented value from unstructured data. Treparel (Delft, The Netherlands) is a independent provider of Text analytics and Visualization software. Organizations like Philips, Bayer, Abbott, NXP Semiconductors are using KMX Text Analytics software to gain faster, reliable, precise insights in large complex unstructured data sets. The KMX API allows software and service companies to enhance their unstructured data analysis capabilities by embedding world class machine learning based clustering, categorization and visualization. A recent review by IDC states: KMX visualization capabilities around its auto-categorization and clustering offer immediate insight into unstructured data sets and appear to be adaptable and customizable to customer needs. Its approach to auto-categorization utilizes statistical principles and machine learning that require significantly less training and tuning on the part of customers than other approaches.

Transcript

<ul><li> 1. Introducing Treparel: Big Data Text Analytics &amp; Visualization applicationsTreparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.comJeroen Kleinhoven CEO jeroen@treparel.comFebruary, 2014</li></ul> <p> 2. IndustryThoughtLeadersaboutTreparel TreparelKMXsvisualiza(oncapabili(esarounditsauto-categoriza8on andclusteringoerimmediateinsightintounstructureddatasetsand appeartobeadaptableandcustomizabletocustomerneeds.Itsapproachto auto-categoriza8onu8lizessta8s8calprinciplesandmachinelearningthat requiresignicantlylesstrainingandtuningonthepartofcustomersthan otherapproaches.DavidSchubmehl,IDCAsweacquiremoreandmoreinforma8on,weneedtoolsthatwillguideus throughthedatamaze.Analystsneedtoolstohelpthemunderstand paGernsanddeneclusters.Usersneedtoexploredatatouncover rela8onshipsfromscaGeredsources.TreparelsKMXservesboththese needswithitsabilitytoclusterandcategorizecollec8onsofdatawithahigh degreeofaccuracy,anditsinterac8vevisualiza8ontoolsthatenable explora8onoflargedatasets.SueFeldman,Synthexis.com(author: TheAnswerMachine. Treparel KMX All Rights Reserved 2013www.treparel.com2 3. Someofourclients&amp;partners KMXisanintegralpartofourIPanalysistoolbox.Itcontributestoour capabilityofmakingaddedvalueIPanalysesoftechnologiesand compe8torstosupportstrategicdecisionmaking.Wevespeedupourpatentsearchesfrom2daysto2hoursusing KMXtechnologywww.fusepool.euTreparel KMX All rights reserved 20143 4. KeyBusinessProblemsTreparelKMXsolves Applica'onAreaBusinessproblemValueIP&amp;PatentSearch HowtoimprovetheBme- consumingandcostlymanual search-processofpatents.ReduceresearchBme,improve precision&amp;recallofrelevant documents.ImprovelegalposiBon anddrivemorerevenuefromIP.Compe''veAnalysis Howtoincreaseknowledgeon compeBtorsbygaining clusteredinsightsfrom(semi-) publicsources.ImprovecompeBBveadvantageby determininginternaBonalstrategy, productroadmap,R&amp;Dplanning, markeBngcampaignsandcustomer senBment.Healthcare HowtoidenBfyhealthrisksand ndcorrelaBonsindeceasesor medicaldefects.EarlyidenBcaBononhealthrisksby cross-disciplineanalysesonmedical records,clinicalobservaBonsand medicalimages.Media&amp;Publishing Howtoimprovesearchand contentanalyBcsonlarge volumesofpublicaBons.TextanalyBcsembeddedinpublishing improvesrelevanceandaccuracyof searchandshowspreviouslyhidden documents.Treparel KMX All Rights Reserved 2013www.treparel.com4 5. KeyBusinessProblemsTreparelKMXsolves-2 UseCasesBusinessproblemValueSen'mentAnalysis Howtomanagecurrentand futurecustomersandtheir interacBonsDerivingsenBmentfromcriBcal customer-basedtextsourcescan driverevenue,saBsfacBonand loyaltyVoiceofCustomer AnalyzingHR-relatedinformaBon HowtomanagecommunicaBons (likeCVsandprojects)tomatch andinteracBonswithemployees, demandtosupply. managers,subordinatesand employmentcandidates eDiscovery HowtomanageandmiBgate generalliBgaBonriskandcostin largesetsoftextandemails.TextanalyBcsappliedto legaltrialsorinlawsand jurisprudenceimprovesaccuracy inlegalcasesandlowerscosts.Predic'veAnalysis HowtoidenBfyearlysignsof requiredmaintenancethataect customersaBsfacBonand operaBonalcostsUsecustomersaBsfacBonsurveys onfoodqualitytoidenBfyairplane ovensrequiringmaintenancetune- ups 5 6. Part1: KMX:ReadytoUseTextAnalyBcs Intui8veContentClustering, Classica8on&amp;Visualiza8onTreparel KMX All rights reserved 2014www.treparel.com6 7. KMXTextAnalyBcsApplicaBonoverview Query &amp; Search ToolsAcquiredocumentsTextPreprocessingandIndexingClusteringClassicaBonVisualizaBonSemanBcAnalysisKMXuniquefuncBons: Extractconceptsincontext usingclusteringand classicaBonofdocuments UseclassicaBontocreate rankedlistsandtotagsubsets SupportofbinaryandmulB- classClassicaBon EnterpriseediBon(server/ cloud)&amp;ProfessionalediBon (desktop) IntegraBonwithother applicaBonsthroughKMXAPITaxonomies, OntologiesPresentResults Treparel KMX All rights reserved 20137 8. Clustering:UserUnsupervisedAnalyBcs Benets:Getquickinsightsthroughautomatedvisualclusters withannotaBonstoenhancethediscoveryprocess 1. AnalyzetheclustersandtherelaBonshipsinthedata 2. Exploreoutliersinthedata 3. Finddocumentsofinterest Whatitdoes:AvisualizaBonofclusterswherethedocuments aredisplayedaspointsandthedistancebetweenthemshows theirsimilarity. WhatKMXdelivers:UseKMXtodo: 1. 2. 3. 4.Performtextpreprocessing(stemming/tokenizaBonetc) Calculatebetweenalldocumentsasimilaritymeasure CalculatevisualizaBon(landscape)withautomaBcannotaBon CreatethevisualizaBon AsastaBcimage OrprovideinteracBonwheretheusercanzoomin/outwith supportforadapBveannotaBonTreparel KMX All rights reserved 2014www.treparel.com8 9. ClassicaBon:UserSupervisedAnalyBcs Benets:Findingfast,accurateandprecisesmallresultsetsandenablingtrend reporBngandAlerBngbyreusingpredenedcategorizaBonmodels. 1. Obtainarankedlistofthemostrelevantdocuments 2. Separatetheimportantdocumentsfromtheirrelevantdocuments(noise) Howitworks:Alistoftherelevantdocumentsdenedfromausers perspecBve. WhatKMXdelivers:UseKMXtodo: 1. Tag(label)asmallnumberofrelevantandirrelevantdocuments UsesearchtoidenBfydocumentsthatneedtobetagged Performmanualtagging SelectdocumentsinteracBvefromthevisualizaBon(brushing) 2. CreateaClassier(categorizer)usingthetaggeddocuments 3. AutomaBcallyperformtheclassicaBononalldocuments 4. Obtaintheimportantdocumentsasrankedhighandtheirrelevant documentswhicharerankedlow Treparel KMX All rights reserved 2014www.treparel.com9 10. VisualizaBon:DiscoveringUnexpectedInsights Benets:KMXVisualisaBonsaresupporBng theprocessofconstrucBngavisualimage inthemindtounderstandthedatabe_er. Howitworks:KMXoersavisualizaBonframeworkwithvariousmethodsfor seeingtheunseen.Itenrichestheprocessofdiscoveryandfostersprofound andunexpectedinsights. WhatKMXdelivers:DierentvisualizaBonsorvisualpipelinesto: Comprehendlargedatasets,datasetsthataretoolargetograspbymental imaginaBon. DiscoverpreviousunknownproperBesofthedatasetthatmaynothave beenanBcipated Revealinherentproblemsofthedata,forinstanceerrorsandartefacts Examinelarge-scalefeaturesofthedatasetaswellasthelocalfeaturesor allowstheusertoseelocalfeaturesinalargerscalereference Letusersformhypothesisbasedonthe(newly)observedphenomenaor developedinsightsTreparel KMX All rights reserved 2014www.treparel.com10 11. Add-onservers: AutoReporBng&amp;BatchClassicaBon AutoRepor'ngServer Supportautomatedanalysisforaggregated resultsformulBpleusers Pie&amp;barcharts LandscapevisualizaBonsforoverviewof subjects EnablingrichinteracBonviawebinterface Classica'onBatchServer high-performancestand-alonetext- classicaBonserver EnableslargescaleparallelprocessingTreparel KMX All rights reserved 2014 Page 11www.treparel.com11 12. BusinessValuefromContentwithKMX TextAnaly'csforAnyoneandEveryoneIntuiBvetouseandlearn.Designed foreveryuser:business(infoconsumers)andscienBc(infocreators). InstantBusinessInsightsExploreallofyourunstructureddata(text,blogs, email,patents)withoutlimits. RapidTimetoValue-Adaptableandcustomizabletousersneeds.No implementaBonorextensiveandexpensivemodellingordevelopment. Signicantlesstrainingandtuning. AnysizedeploymentMeetseverybusinessneedfromasingleusertolarge mulBleveltypeusergroups. LanguageindependentSearchandanalyzemostoftheworldslanguages usingmachinetranslaBon. Anykindordeployment-Useitfromyourdesktoporina-private-cloud.Buy thesocware-as-a-serviceorgettheoutput-as-a-service. Enterprise-proven,IP&amp;ITfriendlySuccessfullydeliveringvaluetoIP,business andmarketsinmulBnaBonalcompanies. Integra'onUsetheKMXAPItoincreasethevalueofunstructureddatainyour IPdiscoveryinfrastructure www.treparel.comTreparel KMX All rights reserved 201212 13. Part2: KMXsocware: UserInterface,keyfunc8ons&amp;valueTreparel KMX All rights reserved 2014www.treparel.com13 14. KMX:Model,Analyse,DiscoverandVisualize inoneviewanddeployittolargescale Searchand highligh'ngBrushingFilteringDocumenttextLandscapevisualiza'onwww.treparel.com rights reserved 2014 Treparel KMX AllColoringofclassica'onscore 14 KMX Example: Ebola, SARS, Bird flue: How do they relate? 15. KMX:OpBmizeOutput usingClassicaBonPerformanceTuning Precision And RecallDocument classica'on forthree classesDistribu'onofclassica'onscores www.treparel.com rights reserved 2014 Treparel KMX All15 16. UseCase1:PerformingsmalltolargescaleSWOT analysis(onAstraZenecapatents) SWOTanalysisexample Startwithremovingirrelevant patentsusingClassica8onand Filteringtodetermine: Whoaretheimportantplayers (assignees,inventors)? Wherearetheimportantpatents led(countries)? WhatisthetrendoverBme(growth ofpatentsovertheyears)? NB:weuseda(very)simplequeryto nd986patentsledunder Astrazeneca. Patent DatabaseQueries+10.000 patentsRankingFilteringRankingFiltering986 patents29 patentsRankingFilteringBusiness User Treparel KMX All rights reserved 2014Output 17. LandscapingandRanking: From986tothemostrelevantpatentsFig: Using vlsual selection (brushing) to build a classification model (Classifier) to be able to rank the full data set and to extract the most relevant.17 18. LandscapingandRanking:WhataremostrelevantRespiratory&amp;Inamma8onpatents? Yellow = most important patents (+80% score) Blue = least relevant patents (for this analysis)NB: crosshair points to 1 specific patent (full text in left pane)Fig: Ranked patents using a Classifier for Respiratory &amp; Inflammation patents (In yellow the selection of 29 18 absolute relevant patents to be further analyzed). We used respiratory to demonstrate highlighting capabilities. 19. HowReliable&amp;Accuratearetheresults?Reviewyourresultswithadvancedperformancetools ThequalityoftheautomaBcclassicaBon(categorizaBon)isshowninthe histogram,whereasmallnumberofdocumentswithahighclassicaBonscore areseparatedfromthelargenumberofdocuments.Fig: Classification performance 1280 patents on biomassNonrelevantdocumentsRelevantdocumentsKMXcalculatesthePrecisionandRecalloftheresultsusingcrossvalidaBon.PrecisionisessenBalfor:Firstanalysis&amp;AlerBngservices Recalliscrucialfor:FreedomtoOperatesearch,ValiditysearchPatentabilitysearch Bothneedtobehighfor:Patentporkoliolandscapeanalysis,TechnologyExploraBon,RiskAssessments19 20. UseCase2: ConceptdetecBonusingdocumentclassicaBon Extrac8ngconceptsincontextfromclassica8onofdocuments 1. VisualizaBonmulBpletopic clusters 2. Selectclusterselectdocuments withsimilartopics 3. Selecttrainingdocumentswithin thesub-cluster 4. BuildClassierandclassify 5. Rankdocumentsndsetof documentswithrelatedconcepts 6. ExtractconceptsKMX Example: Ebola, SARS, Bird flue: How do they relate?Treparel KMX All rights reserved 2014Page20|20 21. Part3: NEW:ContentDashboard(InfoApp) IntegratedSAASbasedsearch,repor8ng, visualiza8onandanalysisTreparel KMX All rights reserved 2014www.treparel.com21 22. RoleofKMXinIntegratedInformaBonApplicaBons Client/ ServerReportingDashboard Informa'on Consumers (+100users)MobileWebSearchAlertingVisualizationExploringDomain or Market Specific InfoApps (by Partners) Management, Development and Integration Text Mining Text PrePCreators/ DataScien'sts (1-5users)Stem/TokenTweets DocumentsTreparel KMX All Rights Reserved 2013IndexingPatent DataClusteringClassificationResearch Literature Enterprise Contentjeroen@treparel.comVisualizeEmailTextWebsites 22 23. ContentDashboard: ContentDrivenAnalyBcalsolu8onEase of Use access to Search, Reporting &amp; Analysis of content like Patents, Emails, Legislation, Application Notes, websites Treparel KMX All rights reserved 2014www.treparel.com23 24. ContentDashboard: ContentanalyBcsbeyondkey-wordsearchInteractive taxonomy with multiple coupled views and advanced search in large sets of documents Treparel KMX All rights reserved 2014www.treparel.com24 25. ContentDashboard: Builtinanaly8cs&amp;interac8vevisualiza8onsAd-hoc or Standard interactive visualizations leading directly to the underlying documents or notes Treparel KMX All rights reserved 2014www.treparel.com25 26. Part4: NEW:KMXAPIforOEMpartners: Putbestinclasscontentanaly8cs inyoursolu8onsTreparel KMX All rights reserved 2014www.treparel.com26 27. SoluBonsbuiltonKMX KMX Empowers InfoApps (solution partners/OEM/VAR)Partner solutions: IP &amp; Patent Analytics Media &amp; Publishing HR eDiscovery (Law &amp; Legislation) Fraud Detection National Security &amp; Police Sentiment analytics CRM/Voice of Customer Government Sharepoint (Enrich &amp; Migrate) Content-based DashboardsKMX platform Big Data Text Analytics (cloud based platform / API)Fig 1. McKinsey diagram showing the three technology layers of the Big Data technology stack27 28. KMXAPIforOEM: EmbedAdvancedTextAnalyBcsinyoursoluBon Clustering Provides users unsupervised analytics and automatically identifies inherent themes or information clusters.Classification Supervised analytics to help users automatically categorize large sets of documents.Through a dynamic hierarchical topic view into search results it enables users to quickly focus on annotated subjects rather than scrolling through long results lists.The Classification process can use a small number of documents sets for learn-byexample categorization.KMX API XML-RPC and REST (JSON) Python Pickle protocolVisualization Advanced visual knowledge discovery for displaying, exporting and sharing data results, ranked document lists, labeled and enriched data or interactive visualizations.Server: User / Tenant mgt User objects mgt (datasets, work spaces, classifiers, stop lists,.) Databases: Oracle, PostgreSQL Client Application: Native Windows (for creating Analysis pipelines) Using QT for GUI Using OpenGL for visualizationsBy sorting the content of documents by topic, relevancy and keywords users can apply their own models or rules for classification.Terms can be extracted to use in building thesauri or taxonomies. Example Applications Areas Advanced Visualizations, Interactive Analytics, Text Disambiguation, Data Enrichment, Clickthrough Optimization, Concept Extraction, Automated Tagging, Semantic Discovery, Named Entity Recognition Document Overlap Display, SWOT analysis, Sentiment Analysis, Predictive Analytics 29. KMX enables information and knowledge professionals to gain faster, reliable, more precise insights in large complex unstructured data sets allowing them to make better informed decisions.Treparel is a leading technology solution provider in Big Data Text Analytics &amp; Visualization </p>