Introducing natural language processing(NLP) with r
Preview:
DESCRIPTION
Charlie
Citation preview
- 1. Introducing NLP with R 10/6/14, 19:37 Introducing NLP with R
Charlie Redmon | SupStat Analytics Copyright Supstat Inc. All
Rights Reserved http://docs.supstat.com/NLPwithR/#1 Page 1 of
26
- 2. Introducing NLP with R 10/6/14, 19:37 Outline Introduction
to NLP Foundational Frameworks Working with text in R Regular
Expressions As pattern matching device Theoretical connection with
finite state automaton Application in morphological analysis - - -
N-gram models Recognizing language Generating language - - Further
reading 2/26 http://docs.supstat.com/NLPwithR/#1 Page 2 of 26
- 3. Introducing NLP with R 10/6/14, 19:37 What+is+NLP? Natural
Language Processing Briefly: Building models to facilitate
human-computer interaction through language We say natural language
here to distinguish languages like English, Hungarian, and Bengali
from computer languages and other invented communication systems
(e.g. Morse code) - - Major sub-disciplines: Speech
Recognition/Synthesis Computational Morphology (word structure)
Lexical Semantics (word meaning) Computational Syntax
(phrase/sentence structure) Compositional Semantics
(phrase/sentence meaning) Information Retrieval - - - - - - 3/26
http://docs.supstat.com/NLPwithR/#1 Page 3 of 26
- 4. Introducing NLP with R 10/6/14, 19:37 Why+R? R has powerful
text processing capabilities Many useful NLP-related packages Many
of the more sophisticated procedures in NLP generalize to
statistical models, which is where R really excels 4/26
http://docs.supstat.com/NLPwithR/#1 Page 4 of 26
- 5. Introducing NLP with R 10/6/14, 19:37
Founda6onal+NLP+Frameworks Turing - Turing Machine: Finite State
Automaton, Finite State Transducer Kleene - Regular Expressions
Chomsky - Regular Languages and their relation to natural languages
Markov: N-gram models HMMs - - Shannon Information Theory Noisy
Channel, Entropy models - - 5/26
http://docs.supstat.com/NLPwithR/#1 Page 5 of 26
- 6. Introducing NLP with R 10/6/14, 19:37 The+Workflow 1. Import
and manipulate text in R 2. Create data structures facilitating NLP
operations 3. Model implementation: Morphological parsing N-gram
parsing N-gram language generation ... 6/26
http://docs.supstat.com/NLPwithR/#1 Page 6 of 26
- 7. Introducing NLP with R 10/6/14, 19:37 Impor6ng+text+into+R
Primary importing functions: scan(), readLines() monty_text =
scan('data/grail.txt', what="character", sep="", quote="")
monty_text[1:6] [1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]"
malayalam_text = scan('data/mathrubhumi_2014-10_full.txt',
what="character", sep="", quote="") malayalam_text[15:20] [1]
"#Date:" "01-10-2014" [3]
"#----------------------------------------" "kt" [5] "++n" "+n" Why
might this data structure be a problem for many natural language
structures? 7/26 http://docs.supstat.com/NLPwithR/#1 Page 7 of
26
- 8. Introducing NLP with R 10/6/14, 19:37
Condensing+to+single+text+stream monty_text = paste(monty_text,
collapse=" ") malayalam_text = paste(malayalam_text, collapse=" ")
length(monty_text); length(malayalam_text) [1] 1 [1] 1
substr(monty_text, 1, 70) [1] "SCENE 1: [wind] [clop clop clop]
KING ARTHUR: Whoa there! [clop clop c" substr(malayalam_text, 304,
400) [1] "4 cc dt cn .... +n .. D. D" 8/26
http://docs.supstat.com/NLPwithR/#1 Page 8 of 26
- 9. Introducing NLP with R 10/6/14, 19:37 Regular+Expressions
SYMBOL MEANING EXAMPLE [] Disjunction (set) / [Gg]oogle / = Google,
google ? 0 or 1 characters / savou?r / = savor, savour * 0 or more
characters / hey!* / = hey, hey!, hey!!, ... Escape character /
hey? / = hey? + 1 or more characters / a+h / = ah, aah, aaah, ...
{n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ... .
Wildcard (any character) / #.* / = #rstats, #uofl, ... ()
Conjunction / (ha)+ / = ha, haha, hahaha, ... [^ ] NOT (negates
bracketed chars) / [^ #.*] / = everything but #... 9/26
http://docs.supstat.com/NLPwithR/#1 Page 9 of 26
- 10. Introducing NLP with R 10/6/14, 19:37 Regular+Expressions
SYMBOL MEANING EXAMPLE [x-y] Match characters from 'x' to 'y' /
[A-Z][1-9] / = A1, Q8, X5, ... w Word character (alphanumeric) /
w's / = that's, Jerry's, ... W Non-word character d Digit character
(0-9) / d{3} / = 137, 254, ... D Non-digit character s Whitespace /
w+s+w+ / = I am, I am, ... S Non-whitespace b Word boundary / btheb
/ = the, not then B Non-word boundary ^ Beginning of line / [a-z] /
= non-capitalized beg. $ End of line / #.*$ / = hashtags at end of
line 10/26 http://docs.supstat.com/NLPwithR/#1 Page 10 of 26
- 11. Introducing NLP with R 10/6/14, 19:37 Manual+segmenta6on
The advantage of having all the text in a single element is we can
now split the text into different-sized segments for different
kinds of natural language tasks. #sentence level pattern = "(?