Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Going Digital Workshop:
Text Mining in Digital Collections
Introduction (Part 1)
Maha Althobaiti,Udo Kruschwitz, Deirdre Lungley,Massimo Poesio
School of Computer Science and Electronic EngineeringUniversity of [email protected]
6 February 2013
Text Mining (Introduction) - February 2013 1
Brief Outline
Part 1: Introduction to natural language engineering(text mining)
Motivation What is it? State of the art
Part 2: Some fundamentals
Part 3: Glimpse into sample applications:
Web search Question-answering
Text Mining (Introduction) - February 2013 2
Example - Machine Translation (Original Page)
Text Mining (Introduction) - February 2013 3
Example - Machine Translation
http://babelfish.yahoo.com/
Text Mining (Introduction) - February 2013 4
Example II - Information Extraction
Text Mining (Introduction) - February 2013 5
Example III - Searching a Web Site
Text Mining (Introduction) - February 2013 6
Early NLP Systems
ELIZA (e.g. http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm)
Weizenbaum 1966 Pattern matching
PARRY
Colby et al. 1971 Pattern matching and planning
SHRDLU
Winograd 1972 Natural language understanding Comprehensive grammar of English
Text Mining (Introduction) - February 2013 7
What is NLE?
Natural Language ... Natural Language Processing Computational Linguistics Information Retrieval Information Extraction
... Engineering Combination of techniques Real world applications Using existing tools and resources
Text Mining (Introduction) - February 2013 8
Why is it Interesting?
Web search engines Information systems Wireless communication Text classification Text summarization Filtering Human-computer dialogues Computer assisted language learning
... and more ...
Generation (e.g. AI in games) Machine translation Speech processing
Text Mining (Introduction) - February 2013 9
Why is it Difficult?
Ambiguity
bank, book, ruler ...I saw the man with the telescope on the hill.Five monkeys ate three bananas.Time flies like an arrow.Lassen Sie uns noch einen Termin ausmachen.
Language is changing
I want to buy a mobile.Google it!
Ill-formed input
accomodation office
Multilinguality
Claudia Schiffer
Text Mining (Introduction) - February 2013 10
Example: Ambiguity + Multilinguality
Text Mining (Introduction) - February 2013 11
Example: Ambiguity + Multilinguality II
Text Mining (Introduction) - February 2013 12
Various Perspectives
Linguist: How can natural language phenomena be described?
Cognitive scientist: How does the human mind work?
Computer scientist: How can a theory be efficiently implemented?
... we look at it from a computer scientist’s perspective
Text Mining (Introduction) - February 2013 13
Two Paradigms
Symbolic (deep) processing
hand coded rules knowledge-rich
Statistical (shallow) processing build up statistical model of language data-rich
... most powerful approaches nowadays are statistical!
Text Mining (Introduction) - February 2013 14
NLE: Where we are
Mature, everyday technology E.g.: tokenization, normalization, regular expression search
Solid technology that can still be improved E.g.: lemmatization; spelling correctors; IR / Web search;
speech synthesis Used in real applications, but needs improvement
E.g.: POS tagging; term extraction; summarization; speechrecognition; text classification (e.g., for spam detection)
Spoken dialogue systems for simple information seeking(railways, phone)
“Almost there” technology E.g.: information extraction, generation systems, simple
speech translation systems Pie in the sky
E.g.: full machine translation, more advanced dialogue
Text Mining (Introduction) - February 2013 15
Part I - Mature Technologies
Research in NLE has been going on for many years and inmany forms - e.g., as part of compiler technology, informationretrieval, etc.
The results of this work are a number of well-establishedtechnologies that are hardly considered “research” anymore
But: these techniques are by no means trivial as you will findout when you write your own little NLE tools ...
Text Mining (Introduction) - February 2013 16
Regular Expressions for Search, Validation and Parsing
Basics (e.g. search in Google) cat OR dog
"Regular * in Java"
More advanced (e.g., regular expressions in PERL, Java) /[Ww]ordnet/
/colou?r/
[a-z|A-Z]*
[^A-Z]
Note also: substitution s/colour/color
ELIZAs/.* I am (sad | depressed).*/I AM SORRY TO HEAR YOU ARE \1/
Text Mining (Introduction) - February 2013 17
Part II: Solid Technology
Technology that could still use improvements
Over the last ten to twenty years new applications haveappeared which are by now fairly well established, but whoseresults are still not 100% accurate (nor is it clear they everwill!)
Text Mining (Introduction) - February 2013 18
Stemming, Lemmatization and Morphological analysis
Stemming: FOXES → FOX
Lemmatization: ’passing’ → ’pass’ +ING ’were’ → ’be’ +PAST Used for: Information Retrieval (e.g. Google)
More general morphological analysis: Gebaudereinigungsfirmenangestellter →
Gebaude + Reinigung + Firma + Angestellter(building + cleaning + company + employee)
Helsingin sanomat →Helsinki +GEN + sanoma +PLURAL
Text Mining (Introduction) - February 2013 19
Part of Speech Tagging
Assign a part of speech to each word: ’dog’ → NOUN ’eat’ → VERB Book that flight
VB DT NN
Applications: all over the place! Information retrieval Information extraction Question answering Translation
Text Mining (Introduction) - February 2013 20
Text Classification
News agencies: classify incoming news stories
Search engines: classify queries, e.g. search Google forObama or LOG313
Identifying spam emails
Routing emails or documents to appropriate people
Text Mining (Introduction) - February 2013 21
Terminology Extraction
It is becoming increasingly difficult for dictionary writers tokeep track of all the new terminology introduced in scientificareas, e.g.:
the C4 carbons of the nicotinamide rings X-ray therapy vs. X-ray therapies PAHO vs. Pan-American Health Organization
TERM EXTRACTION systems are systems that can processlarge amounts of scientific papers and publications andidentify terminology, possibily comparing it with a previous list
Text Mining (Introduction) - February 2013 22
Information Retrieval / Web Search / Question-Answering
Google, Yahoo!, Bing, Yandex, Baidu, ...
Wolfram Alpha question-answering (QA) system:
http://www.wolframalpha.com
evi (formerly TrueKnowledge):
http://www.evi.com
START QA system:
http://www.ai.mit.edu/projects/infolab/
This region is a hub of activity in search! E.g. have a look atEnterprise Search Meetup London and Enterprise Search MeetupCambridge
Text Mining (Introduction) - February 2013 23
Part III: More Advanced NL Technology
This goes from technologies which have been used for yearsbut still have problems, to technologies for which onlyprototypes have been developed, e.g.:
Generation Machine translation Speech recognition Spoken dialogue interfaces
Text Mining (Introduction) - February 2013 24
Text Mining (Introduction II) 1
Text Mining in Digital Collections - Introduction (Part 2)
Linguistic Essentials;; Common Processing Pipelines
Text Mining (Introduction II) 2
Levels of Linguistic Analysis
Word level: Parts of speech: DOG, EAT, RED Sub-word level:
Phonetics, Phonology Morphology
Phrase level (syntax): THE RED DOG, CUTTING CORNERS, BY
Semantics: E.g., lexical semantics COURSE / MODULE, DOG / ANIMAL
Discourse: E.g., anaphora MOST STUDENTS SETTING OFF ON GAP YEAR TRIPS WILL NEED THEIR MONEY AND POSSESSIONS TO FIT SECURELY AND COMPACTLY INTO A RUCKSACK AND MONEYBELT
Text Mining (Introduction II) 3
Words: Parts of speech
Words belong to different classes The sad / sheep / *run / *always / *most dog barked
Basic PARTS OF SPEECH: NOUN: dog, man, car, law ADJECTIVE: red, fat, brave VERBS: run, barked Best known set of part of speech TAGS: Brown TAGS
NN for nouns VB for verb base forms JJ for adjectives in positive form
Notice: many words belong to more than one class Open and closed classes
Text Mining (Introduction II) 4
Levels of linguistic processing: the basic pipeline of an NLP system (e.g., GATE)
PREPROCESSING LEXICAL
PROCESSING SYNTACTIC
PROCESSING
SEMANTIC PROCESSING
DISCOURSE PROCESSING
Text Mining (Introduction II) 5
Example: processing a query to a web search engine
PREPROCESSING LEXICAL
PROCESSING SYNTACTIC
PROCESSING
SEMANTIC PROCESSING WEB ACCESS
List the estate agents in Stratford, London.
TOKENIZATION
POS TAGGING TERM IDENTIFICATION
STOP WORDS
SYNONYMS
Text Mining (Introduction II) 6
More advanced examples: Information Extraction Systems (e.g., LASIE)
PREPROCESSING LEXICAL
PROCESSING SYNTACTIC
PROCESSING
SEMANTIC PROCESSING
DISCOURSE PROCESSING
Text Mining (Introduction II) 7
In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share
Late last night the company announced a growth of 20%.
Preprocessing, I: tokenizing
<P> <W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W> <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W> <W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> <W C='W'>share</W> </P>
PARAGRAPH MARKUP;; TOKENIZER
Text Mining (Introduction II) 8
In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share
Late last night the company announced a growth of 20%.
Preprocessing, I: tokenizing
<P><W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W> <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W><W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> <W C='W'>share.</W></P>
<P><W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W> <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W> <W C='W'>of</W> <W C='CD'>20</W><W C='W'>%</W><W C='.'>.</W></P>
PARAGRAPH MARKUP;; TOKENIZER
Text Mining (Introduction II) 9
Preprocessing,II: sentence splitting
<P>
<S> <W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W> <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W><W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> <W C='W'>share></W> <W C='.'>.</W></S>
</P>
<P>
<S> <W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W> <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W> <W C='W'>of</W> <W C='CD'>20</W><W C='W'>%</W> <W C='.'>.</W> </S> </P>
Text Mining (Introduction II) 10
Lexical Processing, I: POS tagging
<W C='CD'>102</W>
<W C='CM'>,</W>
Text Mining (Introduction II) 11
Lexical Processing, II: lemmatizing / stemming
<W C='CD'>102</W>
<W C='CM'>,</W>
Text Mining (Introduction II) 12
An example of practical (partial) Parsing: Identifying numerical expressions
<NUMEX>
<W C='CD'>102</W>
</NUMEX> <W C='CM'>,</W>
Text Mining (Introduction II) 13
An example of practical semantic processing: identifying semantic type
<W C='CD'>102</W>
</NUMEX> <W C='CM'>,</W>
Text Mining (Introduction II) 14
An example of discourse processing: resolving anaphoric references
In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share
Late last night the company announced a growth of 20%.
Text Mining (Introduction II) 15
Basic Pre-processing;; Regular Expressions
Text Mining (Introduction II) 16
The basic tasks in text processing
TOKENIZATION: identify tokens in text WORD COUNTING: count words and their frequencies SEARCHING FOR WORDS NORMALIZATION:
UDO KRUSCHWITZ, udo kruschwitz, udo krushwitz Udo Kruschwitz Feb 6, 6th of February, 06/02/2013
STEMMING
Text Mining (Introduction II) 17
Regular Expressions : a formalism for expressing search patterns
Because matching is a very common problem, over the years computer scientists have identified a set of patterns that 1. Are very common 2. Can be searched for efficiently
The language of REGULAR EXPRESSIONS has been developed to characterize these patterns
web search tools / software systems (awk, sed, emacs) allow users to use regular expressions to specify what they are searching these REs are then compiled into efficient code You do not need to write the code yourselves!
Text Mining (Introduction II) 18
Types of regular expressions
Disjunction: /centre|center/ /accomodation|accommodation/ Also:
/[Cc]entre/ /acco[m|mm]odation/
Repetitions: +: Any number greater than 0
/YES+!/ Matches YES!, YESS!, YESSS! E.g., any binary number: [01]+
*: 0 or more /ab*/ Matches a, ab, abb, abbb
Text Mining (Introduction II) 19
Applications of more complex REs
Web pages about Centres and Centers: /[Cc]entre|[Cc]enter/
Regular expression to validate phone numbers: (+44)(0)20-12341234, 02012341234, +44 (0) 1234-1234 But not: (44+)020-12341234, 12341234(+020) ^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$
Validating email addresses: [email protected], [email protected], [email protected] But not: asmith, @mactech.com, a@a ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]1,3\.[0-9]1,3\.[0-9]1,3\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]2,4|[0-9]1,3)(\]?)$
Text Mining (Introduction II) 20
REs in action (linked to our own research)
Use of regular expressions as part of a processing pipeline to process query logs of a digital library (in this case: The European Library) Why? We want to automatically learn what query modification suggestions to propose to (future) users of the library search engine, e.g.:
Text Mining (Introduction II) 21
1889115 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("mozart") xxxx search_url xxxx xxxx 0 xxxx - xxxx xxxx xxxx 2008-06-24 22:02:52 1889118 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("mozart") xxxx view_full xxxx xxxx 1 xxxx xxxx xxxx xxxx 2008-06-24 22:03:03 1889120 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx klavierkonzerte xxxx search_res_rec_all xxxx xxxx 0 xxxx - xxxx xxxx xxxx 2008-06-24 22:03:55 1889121 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("klavierkonzerte") xxxx view_full xxxx xxxx 1 xxxx xxxx xxxx xxxx 2008-06-24 22:04:10
Text Mining (Introduction II) 22
Here comes (part of the) the processing pipeline:
more $CleanedFile | gawk 'BEGIN FS = " xxxx " if (($5=="en") && ($4
!= "null") && ($6 !~ /^[ "(]*(test|toto|a)[ ")]*$/) && ($7~/^search_/) && ($7 !~ /^search_adv/)) print $4 " xxxx " $1 " xxxx " $6 " xxxx " $13' | sort -k 1,1 -k 3,3n | gawk 'BEGIN FS = " xxxx " if ((OLD_ID == $1 && (OLD_QUERY !=$3))) print OLD_LINE;; PRINT = "on";; if ((OLD_ID != $1) && (PRINT == "on")) print OLD_LINE;; PRINT = "off";; OLD_ID=$1;; split($_, a, " xxxx ");; OLD_QUERY = a[3];; OLD_LINE = $_' > $SessionQueries
Text Mining (Introduction II) 23
This is what I get:
8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889115 xxxx ("mozart") xxxx 2008-06-24
22:02:52 8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889120 xxxx klavierkonzerte xxxx 2008-
06-24 22:03:55
Text Mining (Introduction II) 24
Basic Ideas of Statistical Approaches
Text Mining (Introduction II) 25
Statistical NLE - Motivation
So far: purely symbolic approaches But instead of hand-coding rules we could derive knowledge from text corpora automatically. Sample applications:
Spell checking Handwriting recognition
Example: How can Google predict the correct spelling of words?
Text Mining (Introduction II) 26
Statistical Methods in NLE
Two characteristics of NL make it desirable to endow programs with the ability to LEARN from examples of past use:
VARIETY (no programmer can really take into account all possibilities) AMBIGUITY (need to have ways of choosing between alternatives)
In a number of NLE applications, statistical methods are very common The simplest application: WORD PREDICTION
Text Mining (Introduction II) 27
We are good at word prediction
Stocks plunged this morning, despite a cut in interest Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall
Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall
Text Mining (Introduction II) 28
Statistics and word prediction
The basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word / correction of spelling error I.e., to compute For all words w, and predict as next word the one for which this (conditional) probability is highest.
P(w | W1 N-1)
Text Mining (Introduction II) 29
Using corpora to estimate probabilities
But where do we get these probabilities? Idea: estimate them by RELATIVE FREQUENCY. The simplest method: Maximum Likelihood Estimate (MLE). Count the number of words in a corpus, then count how many times a given sequence is encountered.
in the corpus
NWWCWWP n
n)..()..( 1
1
Text Mining (Introduction II) 30
Summary: Statistical Approaches
Statistical methods typically need an annotated corpus (Maha Methods can simply rely on counting (and a bit of smoothing) Same principle for any of your applications Very powerful! Use existing tools! (Massimo, Maha
Text Mining in Digital CollectionsIntroduction (Part 3)
Text Mining (Introduction III) - February 2013 1
Information Retrieval
Motivation: Large natural language text collections Find most relevant documents for a user query
Why has IR become so popular? World Wide Web Computer hardware
Text Mining (Introduction III) - February 2013 2
Information Retrieval: Example 1
Web search queries to http://www.dogpile.com/ (29/11/09):
(1) 0117904900
(2) flowers
(3) www.amazon.com
(4) policies and procedures
(5) site:http://www.procracksworld.ws norman virus
comtrol
(6) panthers
(7) why do we yawn
(8) high capacity 6 cell 7.2 volt nicad battery
(9) google
(10) lumberjack song
Text Mining (Introduction III) - February 2013 3
Information Retrieval: Example 2
Text Mining (Introduction III) - February 2013 4
Document = Bag of Words
Need to locate “important” information in a document Deep natural language understanding too expensive Common assumption: document is a list of words Documents represented by a set of index terms
Text Mining (Introduction III) - February 2013 5
IR - Retrieval Process
1. Define text database, i.e. data sources, operations on textand text structure
2. Process document collection into index database for fastaccess
3. Parse user query using same text processing operations4. Apply query operations to processed query5. Generate system query to retrieve matching documents6. Rank and display results
Text Mining (Introduction III) - February 2013 6
IR - Indexing StepsMotivation
Same words in different realisations Content words vs. stopwords Morphology Size of index tables
Typical Indexing Steps
1. Lexical analysis2. Deletion of stopwords3. Stemming4. Selection of index terms
Text Mining (Introduction III) - February 2013 7
IR Indexing - Lexical Analysis Removal of punctuation, e.g.:
F. Crestani et al. “Is this document relevant?
..probably”. ACM Computing Surveys,
30(4):528-552, 1998.
But: “i++” Case folding, e.g.:
London –> london
LONDON –> london
Removal of digits, but:ECIR 2013
3G (or first query on page 330!)
Text Mining (Introduction III) - February 2013 8
IR Indexing - Stopwords Very frequent words are not good discriminators Zipf’s Law: few frequent words make up large part of a
document Closed class vs. open class words INQUERY’s stopword list (entries starting with letter a):
a, about, above, according, across, after,
afterwards, again, against, albeit, all, almost,
alone, along, already, also, although, always,
among, amongst, am, an, and, another, any,
anybody, anyhow, anyone, anything, anyway, ...
Domain-specific stopwords, e.g. in a Web searchapplication: search, webmaster, www, copyright, ...
But: To be or not to be
Text Mining (Introduction III) - February 2013 9
IR Indexing - Stemming Reduce words to word stems Usually gain in recall, but degrading precision Simplest: suffix stripping Porter stemmer (see book for details):
develop –> develop
developing –> develop
development –> develop
developments –> develop
But (Porter stemmer):photographer –> photograph
photography –> photographi
Possible problems with language or stemming algorithm:peter, hospitality, hardly
Text Mining (Introduction III) - February 2013 10
IR Indexing - Selection of Index Terms
Usually nouns Noun phrase detection, e.g.:
United States
Intelligent Document Retrieval
list of papers
School of Computer Science and Electronic
Engineering
Text Mining (Introduction III) - February 2013 11
IR - Classic Models
Boolean model Probabilistic model Vector space model
Text Mining (Introduction III) - February 2013 12
IR - Vector Space Model Most popular of all models (example: SMART) Documents ranked according to similarity with query Non-binary weights assigned to index terms (the higher the
value, the more important the term) Query and document are represented as vectors of index
terms Vector algebra allows elegant calculation of similarity:
cosine of the angle between the vectors is a measure ofsimilarity
Advantages: simple mechanism, term weighting, partialmatching, ranking of documents as similarity function
Disadvantage: independence assumption, i.e. index termsare assumed to be independent
Text Mining (Introduction III) - February 2013 13
IR - Vector Space Model: Similarity Measure
Query and document represented as vectors ofn-dimensional space with features representing terms thatoccur in them
Value of each feature (the weight) is determined bypresence or absence of term in the document or a query
Similarity between two vectors (query and documentvector) gives measure of how closely query matchesdocument
Text Mining (Introduction III) - February 2013 14
IR - Vector Space Model: Similarity Measure (contd.)
Vector algebra: θ is angle between vectors X and Y:
cosθ =X ∗ Y
|X ||Y | Similarity between a query qk and a document dj :
sim( qk , dj) =
N
i=1 wi,kwi,jN
i=1 w2i,k
N
i=1 w2i,j
Result of this measure is a value between 0 and 1
Text Mining (Introduction III) - February 2013 15
IR - Vector Space Model: Term Weighting Infrequent terms should get higher weight than frequent
ones Weight should reflect number of occurrences within a
document The weight of each term in a document is calculated with
the tf.idf formula (term frequency - inverse documentfrequency):
tfidfi,k = fi,k ∗ log(N
dfi)
Term i in document k gets value tfidfi,k in a collection of N
documents, with fi,k the frequency of term i in document k
and dfi the number of documents with term i
Text Mining (Introduction III) - February 2013 16
IR - Evaluation
Target
Retrieved
tn fntp
fp
tn = true negativesfn = false negativestp = true positivesfp = false positives
Precision =tp
tp + fp
Recall =tp
tp + fn
Text Mining (Introduction III) - February 2013 17
Example IR Systems
Apache Solr (based on Lucene)http://lucene.apache.org/solr/
Nutch open source (Lucene-based) search engine,very simple!http://nutch.apache.org/
Text Mining (Introduction III) - February 2013 18
Information Retrieval vs. Information Extraction
IE systems are normally very knowledge-intensive difficult to build domain dependent computationally more expensive than IR
IR systems are normally easy to build very robust domain independent very quick not trying deep language understanding
Text Mining (Introduction III) - February 2013 19
Question Answering
Idea is to find answers, not documents
Typical sequence of steps (see Radev et al. 2005): Question type recognition (using the Brill POS tagger ...) Document retrieval (using a standard search engine API) Passage/sentence retrieval (using n-grams as well as
vector space model) Answer extraction (using chunking) Phrase/answer ranking (using probabilities derived from
POS signatures that identify with a particularquestion/answer type)
Example: IBM’s Watson project,
Text Mining (Introduction III) - February 2013 20
References
D. Jurafsky and J. Martin: “Speech and LanguageProcessing”, Pearson, 2008.http://www.cs.colorado.edu/˜martin/slp.html
Good repository of information about NLE/HLT (“HumanLanguage Technologies”):http://www.lt-world.org/
Starting point to explore tools:http://nlp.stanford.edu/software/index.shtml
http://en.wikipedia.org/wiki/
Outline_of_natural_language_processing
Some of the slides were prepared by colleagues elsewhere
Text Mining (Introduction III) - February 2013 21