13
Science: Text and Language Dr Andy Evans

Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics

Embed Size (px)

Citation preview

Science: Text and Language

Dr Andy Evans

Text analysis

Processing of text.

Natural language processing and statistics.

Processing text: Regex

Java Regular Expressionsjava.util.regex

Regular expressions:Powerful search, compare (and replace) tools.

(other types of regex include direct replace options – in java regex these are separate methods)

Regex

Standard java:

if ((email.indexOf(“@” > 0) &&

(email.endsWith(“.org”))) {

return true;

}

Regex version:

if(email.matches(“[A-Za-z]+@[A-Za-z]+\\.org”)) return true;

Example components[abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z, or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-z&&[def]] d, e, or f (intersection) [a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction). Any character (may or may not match line terminators) \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w]? Once or not at all* Zero or more times+ One or more times

Matching

Find all words that start with a number.

Pattern p = Pattern.compile(“\\d\\.*”);

Matcher m = p.matcher(stringToSearch);

while (m.find()) {

String temp = m.group();

System.out.println(temp);

}

Replacing

replaceFirst(String regex, String replacement)

replaceAll(String regex, String replacement)

Regex

Good start is the tutorial at:http://docs.oracle.com/javase/tutorial/essential/regex/

Also Mehran Habibi’s Java Regular Expressions.

Natural Language Processing

A large part is Part of Speech (POS) Tagging:Marking up of text into nouns, verbs, etc., usually based on the location in the text and other context rules.

Often formulates these rules using machine-learning (of various kinds), training the program on corpora of marked-up text.

Used for :Text understanding.Knowledge capture and use.Text forensics.

NLP Libraries

Popular are:

Natural Language Toolkit (NLTK; Python)http://www.nltk.org/

OpenNLP (Java)http://opennlp.apache.org/index.html

OpenNLP

Sentence recognition and tokenising.Name extraction (including placenames).POS Tagging.Text classification.

For clear examples, see the manual at:http://opennlp.apache.org/documentation.html

Other info

Other than the Numerical Recipes books, the other classic texts are Donald E. Knuth’s The Art of Computer ProgrammingFundamental Algorithms Seminumerical Algorithms Sorting and SearchingCombinatorial Algorithms

But at this stage, you’re better off getting…

Other infoMichael T. Goodrich and Roberto Tamassia’s Data Structures and Algorithms in Java.

Basic java, arrays and list.Recursion in algorithms.Key mathematical algorithms.Algorithm analysis.Data storage structures (stacks, queues,

hashtables, binary trees, etc.)Search and sort.Text processing.Graph/network analysis.Memory management.