View
12
Download
0
Category
Preview:
Citation preview
August 25, 2015
© Analytics Inside 2014-2015
Advanced.
Analytical.
Intelligence.
Big Data Text AnalyticsVictoria Loewengart and Michael Covert
Agenda
• Introductions• What is Text Analytics / Natural Language Processing• Why Text Analytics is a Big Data problem• Need for TA/NLP• Basic TA/NLP Concepts• TA big data implementation with traditional TA
technologies• Advanced TA/NLP Concepts
– Semantic relationships and ontologies– Sentiment– Clustering and topic extraction
• Big data topic extraction algorithms bake-off• Summary and Conclusion
2August 25, 2015 Copyright © 2014-2015 Analytics Inside
• Text analytics is another “old but now new again” trend
– Reading and understanding text
– Heavily reliant on machine learning
– Areas of focus:• Sentiment analysis
• Extraction of “named entities”– Connecting named entities through references, actions, etc.
• Grouping documents with similar characteristics
• Assigning documents to “topics”
• Clustering (similarity / trending)
3August 25, 2015 Proprietary and Confidential
Text Analytics and Natural Language Processing
Definitions
• Natural Language Processing (NLP) is understanding, analysis, manipulation, and/or generation of natural (spoken) languages.
• Computational Linguistics is the study of the applications of computers in processing and analyzing language, as in automatic machine translation and text analysis.
• Text Analytics is the process of deriving high-quality information from text.
• Text Mining is the process of finding new things from text analytics that were previously unknown
4August 25, 2015 Copyright © 2014-2015 Analytics Inside
• What does Big Data have to do with Text Analytics and Natural Language Processing?
– There are now a million words in the English language and about 3.4 billion combinations (N-grams)!• This is clearly a Big Data problem
– Refining and improving language recognition benefits from massive amounts of data• Original datasets were relatively small – 42,000 sentences
• Newer datasets are huge – 1,000,000 sentences and more
• Language processing is a classic “long tail” distribution
• Google “billion word” project
5August 25, 2015 Proprietary and Confidential
Big Data and TA/NLP
• More and more need exists – free text data is exploding due to social media, voice recognition, and other “automated” systems
• The belief is therefore that Big Data will provide better capabilities for understanding language.
• Machine learning has become key. Rule based NLP is still used, but most new science is statistical.– 14.7 words per day are added! Rules cannot be updated
fast enough.
6August 25, 2015 Proprietary and Confidential
Need for TA/NLP
Using Big Data Technologies
• Extracting, ingesting, digitizing, and preparing the text for mining– Connectivity to a broad spectrum of data sources.– Text ingestion and conversion.– Text preprocessing and preparation.
• Mapping your use cases to linguistic, statistical, trained, and unsupervised techniques– Text processing using linguistic rules.– Statistical text analysis.– Supervised and unsupervised techniques.
• Enrich the data and analyzing the findings– Post-processing and data enrichment with domain
knowledge.– A UI for browsing, refining, and analysis.
7August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.
Basic Concepts
• Information Retrieval (IR) refers to the human-computer interaction (HCI) that happens when we use a machine to search a body of information for information objects (content) that match our search query. Depending on the sophistication of the algorithm, a person's query is matched against a set of documents to find a subset of 'relevant' documents.
• Information Extraction (IE) is extraction of specific information such as Named Entities, Events, and Facts.
• Metrics are Precision, Recall, and F-Measure
8August 25, 2015 Copyright © 2014-2015 Analytics Inside
9August 25, 2015 Copyright © 2014-2015 Analytics Inside
• Named Entities applicable to most domains:– People names
– Organization names
– Dates
– Locations (Countries, Cities, Continents/geographic terms)
– Currency
• Domain specific named entities:– Diseases, diagnoses, procedures, body parts
– Drugs, dosages, and usage
– Identifiers – SSN, Driver’s license, Claim number, Domain name, URL
Named Entity Extraction
The National Information Exchange Model (NIEM)
10August 25, 2015 Copyright © 2014-2015 Analytics Inside
PersonName
AddressPhone
IdentificationLicense
CompanyVehicle
…
PatientMedical ProviderHospital or Facility
PharmaceuticalDiagnosis / Injury
ProcedurePharmacy
Medical ReportBiometrics
…
Police ReportCoroner ReportArrest Record
ChargeConviction
Enforcement AgencyAlias
ObservationWeapon
Criminal Method…
User IDIP Address
Network OriginationOnline postings
Social Media PagesEmail
Text Messages…
Info Bearing EntityDocument
URLTerm
ConceptSentiment
…
Security logWeb log
AssetAsset classHR Report
Encryption Method…
Financial InstrumentEventTask
LanguagePredictionInference
…
AccountCredit Card
PolicyClaimLienTitle
…
11August 25, 2015 Proprietary and Confidential
Simple TA example using MapReduce
12August 25, 2015 Proprietary and Confidential
Parallelize NLP Operations
Relationships
• Relationships may occur through communication, friendship, advice, influence, or exchange. The two basic elements of a relationship network are links and nodes.
• Relationship analysis is the mapping and measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities.
• Semantic relations among words can be extracted from their textual context in natural languages.
• Graphs allow us to store the relationships between entities, and algorithms allow us to interrogate these connections.
13August 25, 2015 Copyright © 2014-2015 Analytics Inside
Simple Relationships - Techniques
• Simple relationships are identified through co-location.
• Co-location is the instance of occurrence within a unit of text• Sentence
• Paragraph
• Document
• Metadata is relevant too – coauthors
• Topics are words that are assigned to a document that relate “concepts.”
14August 25, 2015 Copyright © 2014-2015 Analytics Inside
15August 25, 2015 Copyright © 2014-2015 Analytics Inside
• Nouns are parsed into sentence structures– Yields <subject> <verb> <object> relationships
– Can usually detect compound subjects and various verb inflective forms
– Captures modifiers (adjectives and adverbs) that can be used in sentiment or inversion
• Graph analysis and graph theory now comes into play– When documents and document sets are processed, typically creates a
very large graph
Semantic Relationships
Clusters of terms
Graph structures
Central terms
Relationships -Example
16August 25, 2015 Copyright © 2014-2015 Analytics Inside
17August 25, 2015 Copyright © 2014-2015 Analytics Inside
• An Ontology is “a description of things that exist and how they relate to each other” (Chris Welty).
• An Ontology Model is:– the classification of entities and
– modeling the relationships between those entities.
Ontologies
Sentiment
• An opinion is a binary expression that consists of two key components:
– A target (which we shall call “topic”, as referred to by most social analytics tools);– A sentiment on the target/topic, often accompanied by a probability.
• Sentiment analysis on content means discerning the opinions in content and picking the mood (attitude) within those opinions.
• A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral.
18August 25, 2015 Copyright © 2014-2015 Analytics Inside
Classification and Clustering
• Classification / Categorization
– The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically.
• Clustering
– Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.
• Machine learning is used
– Supervised uses “known results”
– Unsupervised finds results from the unknown
19August 25, 2015 Copyright © 2014-2015 Analytics Inside
Extracting topics
20August 25, 2015 Copyright © 2014-2015 Analytics Inside
TopicCluster ID
Topic
Topic
Topic
Topic
Topic
Cluster ID
Cluster ID
Cluster ID
Cluster ID
Cluster IDProbability
ClusteringLDA and CVB
Documents
Probability
Probability
Summary
Analytics Inside ™ - 2015
Term
Term
Term
Term
Extracting topics
21August 25, 2015 Copyright © 2014-2015 Analytics Inside
Probability
Summary
Analytics Inside ™ - 2015
Document Term Matrix
22August 25, 2015 Copyright © 2014-2015 Analytics Inside
Space reduction, Latent Semantic Indexing, and eigenvectors
Reveals the most important terms in a set of documents
Note that this looks justlike a graph adjacency matrix!
• OpenNLP– The Apache OpenNLP library is a machine learning based Java toolkit
for the processing of natural language text.
• NLTK– It is a Python library. Provides easy-to-use interfaces to over 50
corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
• Stanford NLP– Java libraries for statistical NLP, deep learning NLP, and rule-based
NLP tools for major computational linguistics problems.
23August 25, 2015 Proprietary and Confidential
Open Source NLP Libraries
• All provide trained machine learning models for NLP processing
• Low level building blocks that can be wrapped in “Big Data Technologies”
• Run NLP operations in parallel
24August 25, 2015 Proprietary and Confidential
Open Source NLP Libraries
Topic Extraction Algorithm Bake-Off
• We ran Mahout CVB and MLLib LDA topic extraction algorithms against the same set of 11 documents describing 1) terrorism and 2) healthcare
– Mahout is Hadoop MapReduce
– MLLib is Spark
• Documents are copied into HDFS
– Stop list is employed
25August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.
Mahout – An example
Running a CVB example
# Create sequencefiles from the text filesmahout seqdirectory -i docs -o sequencefiles/ -c UTF-8 -chunk 5
# Generate vectors from sequence files and calculate the weights of the termsmahout seq2sparse -i sequencefiles/ -o vectors/ -ow -wt tfidf -x 4800 -nv
# create matrixmahout rowid -i vectors/tfidf-vectors -o matrix
# run cvbmahout cvb -i matrix/matrix -o lda_output -mt lda_output/models -dtlda_output/docTopics -k 2 -nt --maxIter 10 --num_terms 10000
# dump resultsmahout vectordump -i lda_output/final -d vectors/dictionary.file-0 -dt sequencefile --vectorSize 10 -sort TRUE
26August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.
MLLib– example• Running MLLib LDA example
– ./bin/run-example mllib.LDAExample --stopwordFile stoplist/stopwords.txt docs --k 2
• Other options include – --maxIterations <value>– number of iterations of learning. default: 10– --docConcentration <value>– amount of topic smoothing to use (> 1.0) (-1=auto). default: -1.0– --topicConcentration <value>– amount of term (word) smoothing to use (> 1.0) (-1=auto). default: -1.0– --vocabSize <value>– number of distinct word types to use, chosen by frequency. (-1=all) default: 10000– --checkpointDir <value>– Directory for checkpointing intermediate results. Checkpointing helps with
recovery and eliminates temporary shuffle files on disk. default: None– --checkpointInterval <value>– Iterations between each checkpoint. Only used if checkpointDir is set. default: 10
27August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.
ComparisonMLLib59.962s
Topic 0 Topic1
homeland patient
committee pain
threat history
somalia disease
minneapolis chest
qaeda upper
american skin
leaders sickle
radicalization normal
security years
Mahout32 minutes
Topic0 Topic1
al patient
shabaab pain
muslim she
homeland disease
qaeda chest
committee upper
american her
u.s. normal
our skin
threat pulmonary
August 25, 2015© 2014 Analytics Inside, LLC. All Rights
Reserved.28
Summary
– Text Analytics is “new again” important science for understanding the meaning of unstructured text
– Text Analytcs is a Big Data problem
– Traditional TA techniques can be used with Big Data technologies
– Machine learning is at the core of Text Analytics
– Major Big Data technologies (Spark, Hadoop) support ML libraries for clustering and topic extraction
29August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.
Questions and Answers
August 25, 2015 30Proprietary and Confidential
Victoria.Loewengart@AnalyticsInside.usMichael.Covert@AnalyticsInside.us
http://www.AnalyticsInside.us
Recommended