Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · •...

August 25, 2015

Advanced.

Analytical.

Intelligence.

Big Data Text AnalyticsVictoria Loewengart and Michael Covert

Agenda

• Introductions• What is Text Analytics / Natural Language Processing• Why Text Analytics is a Big Data problem• Need for TA/NLP• Basic TA/NLP Concepts• TA big data implementation with traditional TA

technologies• Advanced TA/NLP Concepts

– Semantic relationships and ontologies– Sentiment– Clustering and topic extraction

• Big data topic extraction algorithms bake-off• Summary and Conclusion

• Text analytics is another “old but now new again” trend

– Reading and understanding text

– Heavily reliant on machine learning

– Areas of focus:• Sentiment analysis

• Extraction of “named entities”– Connecting named entities through references, actions, etc.

• Grouping documents with similar characteristics

• Assigning documents to “topics”

• Clustering (similarity / trending)

3August 25, 2015 Proprietary and Confidential

Text Analytics and Natural Language Processing

Definitions

• Natural Language Processing (NLP) is understanding, analysis, manipulation, and/or generation of natural (spoken) languages.

• Computational Linguistics is the study of the applications of computers in processing and analyzing language, as in automatic machine translation and text analysis.

• Text Analytics is the process of deriving high-quality information from text.

• Text Mining is the process of finding new things from text analytics that were previously unknown

• What does Big Data have to do with Text Analytics and Natural Language Processing?

– There are now a million words in the English language and about 3.4 billion combinations (N-grams)!• This is clearly a Big Data problem

– Refining and improving language recognition benefits from massive amounts of data• Original datasets were relatively small – 42,000 sentences

• Newer datasets are huge – 1,000,000 sentences and more

• Language processing is a classic “long tail” distribution

• Google “billion word” project

Big Data and TA/NLP

• More and more need exists – free text data is exploding due to social media, voice recognition, and other “automated” systems

• The belief is therefore that Big Data will provide better capabilities for understanding language.

• Machine learning has become key. Rule based NLP is still used, but most new science is statistical.– 14.7 words per day are added! Rules cannot be updated

fast enough.

Need for TA/NLP

Using Big Data Technologies

• Extracting, ingesting, digitizing, and preparing the text for mining– Connectivity to a broad spectrum of data sources.– Text ingestion and conversion.– Text preprocessing and preparation.

• Mapping your use cases to linguistic, statistical, trained, and unsupervised techniques– Text processing using linguistic rules.– Statistical text analysis.– Supervised and unsupervised techniques.

• Enrich the data and analyzing the findings– Post-processing and data enrichment with domain

knowledge.– A UI for browsing, refining, and analysis.

Basic Concepts

• Information Retrieval (IR) refers to the human-computer interaction (HCI) that happens when we use a machine to search a body of information for information objects (content) that match our search query. Depending on the sophistication of the algorithm, a person's query is matched against a set of documents to find a subset of 'relevant' documents.

• Information Extraction (IE) is extraction of specific information such as Named Entities, Events, and Facts.

• Metrics are Precision, Recall, and F-Measure

• Named Entities applicable to most domains:– People names

– Organization names

– Dates

– Locations (Countries, Cities, Continents/geographic terms)

– Currency

• Domain specific named entities:– Diseases, diagnoses, procedures, body parts

– Drugs, dosages, and usage

– Identifiers – SSN, Driver’s license, Claim number, Domain name, URL

Named Entity Extraction

The National Information Exchange Model (NIEM)

PersonName

AddressPhone

IdentificationLicense

CompanyVehicle

PatientMedical ProviderHospital or Facility

PharmaceuticalDiagnosis / Injury

ProcedurePharmacy

Medical ReportBiometrics

Police ReportCoroner ReportArrest Record

ChargeConviction

Enforcement AgencyAlias

ObservationWeapon

Criminal Method…

User IDIP Address

Network OriginationOnline postings

Social Media PagesEmail

Text Messages…

Info Bearing EntityDocument

URLTerm

ConceptSentiment

Security logWeb log

AssetAsset classHR Report

Encryption Method…

Financial InstrumentEventTask

LanguagePredictionInference

AccountCredit Card

PolicyClaimLienTitle

Simple TA example using MapReduce

Parallelize NLP Operations

Relationships

• Relationships may occur through communication, friendship, advice, influence, or exchange. The two basic elements of a relationship network are links and nodes.

• Relationship analysis is the mapping and measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities.

• Semantic relations among words can be extracted from their textual context in natural languages.

• Graphs allow us to store the relationships between entities, and algorithms allow us to interrogate these connections.

Simple Relationships - Techniques

• Simple relationships are identified through co-location.

• Co-location is the instance of occurrence within a unit of text• Sentence

• Paragraph

• Document

• Metadata is relevant too – coauthors

• Topics are words that are assigned to a document that relate “concepts.”

• Nouns are parsed into sentence structures– Yields <subject> <verb> <object> relationships

– Can usually detect compound subjects and various verb inflective forms

– Captures modifiers (adjectives and adverbs) that can be used in sentiment or inversion

• Graph analysis and graph theory now comes into play– When documents and document sets are processed, typically creates a

very large graph

Semantic Relationships

Clusters of terms

Graph structures

Central terms

Relationships -Example

• An Ontology is “a description of things that exist and how they relate to each other” (Chris Welty).

• An Ontology Model is:– the classification of entities and

– modeling the relationships between those entities.

Ontologies

Sentiment

• An opinion is a binary expression that consists of two key components:

– A target (which we shall call “topic”, as referred to by most social analytics tools);– A sentiment on the target/topic, often accompanied by a probability.

• Sentiment analysis on content means discerning the opinions in content and picking the mood (attitude) within those opinions.

• A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral.

Classification and Clustering

• Classification / Categorization

– The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically.

• Clustering

– Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.

• Machine learning is used

– Supervised uses “known results”

– Unsupervised finds results from the unknown

Extracting topics

TopicCluster ID

Cluster ID

Cluster IDProbability

ClusteringLDA and CVB

Documents

Probability

Summary

Analytics Inside ™ - 2015

Extracting topics

Probability

Summary

Analytics Inside ™ - 2015

Document Term Matrix

Space reduction, Latent Semantic Indexing, and eigenvectors

Reveals the most important terms in a set of documents

Note that this looks justlike a graph adjacency matrix!

• OpenNLP– The Apache OpenNLP library is a machine learning based Java toolkit

for the processing of natural language text.

• NLTK– It is a Python library. Provides easy-to-use interfaces to over 50

corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning

• Stanford NLP– Java libraries for statistical NLP, deep learning NLP, and rule-based

NLP tools for major computational linguistics problems.

Open Source NLP Libraries

• All provide trained machine learning models for NLP processing

• Low level building blocks that can be wrapped in “Big Data Technologies”

• Run NLP operations in parallel

Open Source NLP Libraries

Topic Extraction Algorithm Bake-Off

• We ran Mahout CVB and MLLib LDA topic extraction algorithms against the same set of 11 documents describing 1) terrorism and 2) healthcare

– Mahout is Hadoop MapReduce

– MLLib is Spark

• Documents are copied into HDFS

– Stop list is employed

Mahout – An example

Running a CVB example

# Create sequencefiles from the text filesmahout seqdirectory -i docs -o sequencefiles/ -c UTF-8 -chunk 5

# Generate vectors from sequence files and calculate the weights of the termsmahout seq2sparse -i sequencefiles/ -o vectors/ -ow -wt tfidf -x 4800 -nv

# create matrixmahout rowid -i vectors/tfidf-vectors -o matrix

# run cvbmahout cvb -i matrix/matrix -o lda_output -mt lda_output/models -dtlda_output/docTopics -k 2 -nt --maxIter 10 --num_terms 10000

# dump resultsmahout vectordump -i lda_output/final -d vectors/dictionary.file-0 -dt sequencefile --vectorSize 10 -sort TRUE

MLLib– example• Running MLLib LDA example

– ./bin/run-example mllib.LDAExample --stopwordFile stoplist/stopwords.txt docs --k 2

• Other options include – --maxIterations <value>– number of iterations of learning. default: 10– --docConcentration <value>– amount of topic smoothing to use (> 1.0) (-1=auto). default: -1.0– --topicConcentration <value>– amount of term (word) smoothing to use (> 1.0) (-1=auto). default: -1.0– --vocabSize <value>– number of distinct word types to use, chosen by frequency. (-1=all) default: 10000– --checkpointDir <value>– Directory for checkpointing intermediate results. Checkpointing helps with

recovery and eliminates temporary shuffle files on disk. default: None– --checkpointInterval <value>– Iterations between each checkpoint. Only used if checkpointDir is set. default: 10

ComparisonMLLib59.962s

Topic 0 Topic1

homeland patient

committee pain

threat history

somalia disease

minneapolis chest

qaeda upper

american skin

leaders sickle

radicalization normal

security years

Mahout32 minutes

Topic0 Topic1

al patient

shabaab pain

muslim she

homeland disease

qaeda chest

committee upper

american her

u.s. normal

our skin

threat pulmonary

Reserved.28

Summary

– Text Analytics is “new again” important science for understanding the meaning of unstructured text

– Text Analytcs is a Big Data problem

– Traditional TA techniques can be used with Big Data technologies

– Machine learning is at the core of Text Analytics

– Major Big Data technologies (Spark, Hadoop) support ML libraries for clustering and topic extraction

Questions and Answers

August 25, 2015 30Proprietary and Confidential

Victoria.Loewengart@AnalyticsInside.usMichael.Covert@AnalyticsInside.us

http://www.AnalyticsInside.us

Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · •...

Documents

Organizing Big Data for Text in Rakuten

BIG BEN - Observation Text (English)

original text - Big Nick and the Cydecos

SparkText: Biomedical Text Mining on Big Data Framework

The Big Surprise - Modern Teaching Aids big... · The Big Surprise by Melissa Leighton lntroducing the text Begin the opening discussion by reading the title and the other cover text

Materi power point big kd 4.2 descriptive text

SMS / Text based lottery is a real big deal!

Text Analytics for Unlocking the Potential of Big Data

Turning big data and text collections into web resrouces

Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

LOTUS: Adaptive Text Search for Big Linked Data

Exploring Google Books Ngram Viewer for Big Data Text Corpus Visualizations

Big Data Text Summarization - Virginia Tech...Big Data Text Summarization Using Deep Learning to Summarize Theses and Dissertations Authors Naman Ahuja Ritesh Bansal William A. Ingram

Big Text Visual Analytics in Sensemaking

International Text Layout & Typography: The Big And Future Picture

Text Mine Your Big Data - SAS

A Text Mining Framework for Big Data€¦ · A Text Mining Framework for Big Data Niki Pavlopoulou, Aeham Abushwashi and Vittorio Scibetta Abstract Text Mining is the ability to gain

Deep Text New Approaches in Text Analytics and Knowledge ......Overload, Get Real Value from Social Media, and Add Big(ger) Text to Big Data . 4 ... – Supports more subtle distinctions

SAP Big Data Forum 2013 D4 Richard Verbeek - Open Text

Applied AI From Text to IoT Analytics - CRIM · Applied AI From Text to IoT Analytics. Text Analytics. 250+ BIG DATA ENGINEERS & LINGUISTS Text Analytics / ... Event1 Event2 Event3