7/29/2019 NLP Research at Internet Age
1/42
NLP Research at Internet AgeAn Overview of NLP at Microsoft Research Asia
Ming ZhouManager of Natural Language Group
Microsoft Research Asia
7/29/2019 NLP Research at Internet Age
2/42
Trends of Internet Services Eco system to work with third partys apps
Apple Apps, Facebook, Twitter, Baidu, Sina, QQ Real time content collection and search Twitter, Facebook, Del.ici.ous, NYT, YouTube
Mobile search Contextual intent understanding Towards decision making and action taking
Social power
Social tags (like) for general search engines Search engines in SNS Social QA
7/29/2019 NLP Research at Internet Age
3/42
Impact and Challenge to NLP Research
Impact
Biggest database ever connects dataBiggest social network connects people
Harnessing collective intelligence
Contextual information processing: User, users socialnetwork, location, time
Real-time information processing: Collection, index,operation without delay
ChallengeHow to leverage data, people, contextual information to
reach real-time information processing?
7/29/2019 NLP Research at Internet Age
4/42
Problems of Traditional NLP
Approaches (NLP 1.0)
Deep in individual component technologies but reach
upper bounds Less consider scenarios, users need, market need
Serious data sparseness with human annotation
Evaluation bottleneck
Slow deployment
Lack effective framework to involve users feedback
4
7/29/2019 NLP Research at Internet Age
5/42
New Strategy of NLP (NLP2.0) Data collection from the web
Domain specific and open-IE Contextual NLP Maximize on the system level not on the
individual component Earlier deployment on Internet Make best use of social factors
5
7/29/2019 NLP Research at Internet Age
6/42
Our Vision and Task
Advanced NLP technologies Word breaker, POS tagging, chunking, syntactic parser, semantic role
labeling, speller, query suggestion, summarization
Chinese, Japanese, English Multi-language information access
Statistical machine translation
Multi-language search
Semantic computing Sentiment analysis, event extraction, ontology learning
Understanding query intent and document
Contextual NLP
Understand user and document in any language, for any device
and any applications
7/29/2019 NLP Research at Internet Age
7/42
Text analysis
Skeleton parser
Named entity identification
Pos tagging
SLM
Componenttechs
Machine Translation
Translation evaluation
Tran. know. acquisition
WEB mining for MT
SMT
Information Extraction
Annotation tool
Machine learning
Term extraction
Information Retrieval
paraphrasing
Vertical search
Cross language IR
NLP enriched Indexing
and search
Query-doc relevance
Text mining
Data
NLP (C, J, E) MT (C, J, E)
MRD
Translation
lexicon
Bilingual corpus
Bilingual tagged
corpus
IR and IE (C,J,E)
MRD
Parsing lexicon Tagged corpus
Balanced corpus
Applications
Chinese IME
Query speller
English writing wizard News Search
Twitter SearchPocket translatorJapanese IME
MSRA NLP Research Overview
Meta data extraction
Couplet generation Resume Routing General web search
Chatbot
Comparison Shopping
7/29/2019 NLP Research at Internet Age
8/42
Research Accomplishment Awards
MSRA Best Research Team(2010)
Finalist of WSJ Asian Innovation Awards (2010) MS ARD Best Project (Engkoo) MSRA Best Innovation (1998-2008): IME and Chinese couplets
Academic impact Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009
Best result in SIGHAN 2006 bake off on Chinese word segmentation Best result in cross language information retrieval in TREC-9, NTCIR-III 40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010) PC Chair, area chair of ACL
Collaboration with universities HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Mediaand Network
400 interns in 12 years Summer schools since 2001
PhD supervisors at universities
8
7/29/2019 NLP Research at Internet Age
9/42
Summer School on Information Extraction
(Harbin, June, 2005)
Cheng Niu: Information
extraction
Frank Seide: Speech
information extraction
and search
Hwee Tou Ng: Advanced
topics of information
extraction
Chin-Yew Lin:
Information extraction
for automaticsummarization
7/29/2019 NLP Research at Internet Age
10/42
Projects based on NLP 2.0 Engkoo: Web-based English learning service
Data mining from the web
Chinese couplets
Include users power into system evolvement Semantic analysis and search of micro-
blogging
Move to SNS, mobile
7/29/2019 NLP Research at Internet Age
11/42
Engkoo
Parallel data mining from the web
Video:http://video.sina.com.cn/v/b/37417609-1286528122.html
7/29/2019 NLP Research at Internet Age
12/42
Rapidly Changing Language Approximately 1.5 billion people speak English as a
primary, secondary or business language China: The largest English speaking country with
250 million English learners and USD 60 billion annual
expenses Problem: Live language: new words, new meanings
Key Insight:With billions of translated web pages and sharable repositories
of language data growing every day, the Internet holds the
sum of human language knowledge
7/29/2019 NLP Research at Internet Age
13/42
www.engkoo.com
Major Features: Microsoft Products:
Endless Lexicon with Native Definitions
State-of-the-Art Machine Translation(NIST OpenMT Winner)
Real-time Interactive Alignment
Bing
Office
MSN
Human-Like TTS & Phonetic Search
7/29/2019 NLP Research at Internet Age
14/42
Massive Dictionary Mined from the
Web
7/29/2019 NLP Research at Internet Age
15/42
Fresh and Diverse Examples
7/29/2019 NLP Research at Internet Age
16/42
Advanced Search with Sentence
Analysis
7/29/2019 NLP Research at Internet Age
17/42
7/29/2019 NLP Research at Internet Age
18/42
Sentences Classification
7/29/2019 NLP Research at Internet Age
19/42
7/29/2019 NLP Research at Internet Age
20/42
7/29/2019 NLP Research at Internet Age
21/42
Learn Contextual Usage with Word
Alignment
7/29/2019 NLP Research at Internet Age
22/42
Learn Contextual Usage with Word
Alignment
7/29/2019 NLP Research at Internet Age
23/42
Learn Contextual Usage with Word
Alignment
7/29/2019 NLP Research at Internet Age
24/42
Hints of Easy-Confused Words
7/29/2019 NLP Research at Internet Age
25/42
7/29/2019 NLP Research at Internet Age
26/42
Knowlege Mining Pipeline
Mined
Data
Parsed
DataLinguistic
Knowledge
WebMining
Indexed
Data
Linguistic
Parsing
Knowledge
Mining
Multi-
level
Indexing
Machine Translation Model
Paraphrasing Model
tokenizing: he could hardly afford to waste that golden time.
skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford) (Tobj~waste~afford)
(Tobj~time~waste) (AdjAttrib~golden~time)
(Tsub~~) (ModAdv~~)(Tobj~~)
(AdjAttrib~~)
alignment: he() could hardly afford to() waste() that()
golden() time()
1. words idiomatic usage
Verb~Noun (decline~offer)
Verb~Adv (greatly~improve)
Adj~Noun (arduous~task)
Adv~Adj (extremely~bad)
2. paraphrasing
turn_on~light, switch_on~light
laborious~task, hard~task
deeply~moved, deeply~touched
3. collocation translations ~,make~plan
~, book~room
~,
subscribe to ~magazine
Parallel Sentence:
He could hardly afford to waste that golden time.
1. single word
he, could, hardly, afford etc., , etc.
2. single word with its POS
he_Pron, could_Verb,hardly_Adv etc.
_Pron, _Adv, _Verb etc.
3. collocation
Tsub~he~afford , Tobj~time~waste etc.
Tsub~~, ModAdv~~etc.
7/29/2019 NLP Research at Internet Age
27/42
Chinese Couplets
Include users power into system
evolvement
7/29/2019 NLP Research at Internet Age
28/42
Chinese Couplets (http://duilian.msra.cn)
http://video.sina.com.cn/v/b/10937201-1452530713.html
7/29/2019 NLP Research at Internet Age
29/42
FS and SS Share the Same Style
(wind)---------------- (water) (blow) --------------- (make)
(buckwheat) -- ------ (ship)(wave)---------------- (go) (bridge) ------------- (island) (not) ----------------- (not)
(wave) ---------------(go)
Repetition of
pronunciations()
7/29/2019 NLP Research at Internet Age
30/42
FS and SS Share the Same Style
(have)----------------- (lack)
(son) ------------------- (fish)
(have) ------------------ (lack) (daughter)------------- (mutton)
(so) --------------------- (dare)
(call) --------------------
(call)(good) -------------------(fresh)
Decomposition of
characters ()
7/29/2019 NLP Research at Internet Age
31/42
FS and SS Share the Same Style
(Banqiao)---------------- (Dongpo)
(produce) ------------------- (live)(bridge) --------------------- (mountain)
(board)----------------------(east)
Person
name
()
Palindrome
()
Banqiao() and Dongpo() are famous litterateurs
Reading from top to down is identical to down to top
7/29/2019 NLP Research at Internet Age
32/42
sky high
SS Generation Process
hill
sky
high
deep
permit
depend
insect
bird
tiger
fly
dance
t e
tweedle
bird fly
hill high
Sea wide allow fish jump
tiger roar
SMT decoding Reranking
Linguistic
filtering
7/29/2019 NLP Research at Internet Age
33/42
SS Generation Approach
A multi-phase SMT approachPhase1: a phrase-based log-linear model
Phase2: some linguistic filters
Phase3: a Ranking SVM
Phrase-based log-
linear model
SS output
Linguistic filters
FS input
N-best
candidates
Ranking SVM
model
7/29/2019 NLP Research at Internet Age
34/42
Great Examples FS:
SS:
FS:
SS:
FS:
SS:
FS: (+=;+=) SS: (+=;+=)
7/29/2019 NLP Research at Internet Age
35/42
7/29/2019 NLP Research at Internet Age
36/42
7/29/2019 NLP Research at Internet Age
37/42
Motivation
Training data is not adequate
While user log is big(60k/m), increasing, diverse
What logs we record
User inputs
User finalized couplets Second sentences selected out of the candidates provided by our system
User modified second sentences
User log for Model Enhancement
7/29/2019 NLP Research at Internet Age
38/42
Users Log AnalysisNumber of input sentences 12,322
Number of unique input sentences 6,698Users directly select from system
output
3,459
User manual modify system output 606
Save as favorite couplets 109
Invalid user input 618
No second sentence generated 2,211
Banner generation 2,687
Select the generated banner as
favorite
428
No banner output 265
Data Source
Log fromhttp://couplet.msra.
cn
Time period
Aug. 31-Oct. 9,
2006
7/29/2019 NLP Research at Internet Age
39/42
New Framework with Log Data
Training data
Source-Channel
model
Second sentenceoutput
Translation
model
Log data
Re-ranking
First sentence
input
Language
model
Mutual
information
N-best
candidates
Translation
model
Language
model
Mutual
information
Useroperation
7/29/2019 NLP Research at Internet Age
40/42
Twitter Search
Move to social internet and mobile
7/29/2019 NLP Research at Internet Age
41/42
Tweets
Noise
Filtering
Raw Data
Semantic
Role Labeling
Sentiment
Analysis
NE
Recognition
Dependency
ParsingCo-reference
Text
NormalizationClassification
Sentence Boundary
Detection
Tweets
Cluster
Statistical
Relationship
Learning
News &
Images Link
Extraction
Community Extraction User Influence Measure
Hot tag, topic Extraction Popular Tweet Extraction
Top video, music, artists Extraction
A collection of tweets
Individual tweet
Multi-lev
elIndexing
Seman
ticSearch
7/29/2019 NLP Research at Internet Age
42/42
Conclusion Internet trends and impacts to NLP
NLP2.0 strategy Web data mining: Engkoo
Users power: Couplets SNS and mobile: Twitter search