71
Introduction to Natural Language Processing Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA University of Edinburgh, UK University of Pennsylvania, USA August 27, 2008

Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Introduction to Natural Language Processing

Steven Bird Ewan Klein Edward Loper

University of Melbourne, AUSTRALIA

University of Edinburgh, UK

University of Pennsylvania, USA

August 27, 2008

Page 2: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Knowledge and Communication inLanguage

• human knowledge, human communication, expressed inlanguage

• language technologies: process human languageautomatically

• handheld devices: predictive text, handwriting recognition• web search engines: access to information locked up in

text• two facets of the multilingual information society:

• natural human-machine interfaces• access to stored information

Page 3: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Knowledge and Communication inLanguage

• human knowledge, human communication, expressed inlanguage

• language technologies: process human languageautomatically

• handheld devices: predictive text, handwriting recognition• web search engines: access to information locked up in

text• two facets of the multilingual information society:

• natural human-machine interfaces• access to stored information

Page 4: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Knowledge and Communication inLanguage

• human knowledge, human communication, expressed inlanguage

• language technologies: process human languageautomatically

• handheld devices: predictive text, handwriting recognition• web search engines: access to information locked up in

text• two facets of the multilingual information society:

• natural human-machine interfaces• access to stored information

Page 5: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Knowledge and Communication inLanguage

• human knowledge, human communication, expressed inlanguage

• language technologies: process human languageautomatically

• handheld devices: predictive text, handwriting recognition• web search engines: access to information locked up in

text• two facets of the multilingual information society:

• natural human-machine interfaces• access to stored information

Page 6: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Knowledge and Communication inLanguage

• human knowledge, human communication, expressed inlanguage

• language technologies: process human languageautomatically

• handheld devices: predictive text, handwriting recognition• web search engines: access to information locked up in

text• two facets of the multilingual information society:

• natural human-machine interfaces• access to stored information

Page 7: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Knowledge and Communication inLanguage

• human knowledge, human communication, expressed inlanguage

• language technologies: process human languageautomatically

• handheld devices: predictive text, handwriting recognition• web search engines: access to information locked up in

text• two facets of the multilingual information society:

• natural human-machine interfaces• access to stored information

Page 8: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Knowledge and Communication inLanguage

• human knowledge, human communication, expressed inlanguage

• language technologies: process human languageautomatically

• handheld devices: predictive text, handwriting recognition• web search engines: access to information locked up in

text• two facets of the multilingual information society:

• natural human-machine interfaces• access to stored information

Page 9: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Problem

• awash with language data• inadequate tools (will this ever change?)• overheads: Perl, Prolog, Java• Natural Language Toolkit (NLTK) as a solution

Page 10: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Problem

• awash with language data• inadequate tools (will this ever change?)• overheads: Perl, Prolog, Java• Natural Language Toolkit (NLTK) as a solution

Page 11: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Problem

• awash with language data• inadequate tools (will this ever change?)• overheads: Perl, Prolog, Java• Natural Language Toolkit (NLTK) as a solution

Page 12: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Problem

• awash with language data• inadequate tools (will this ever change?)• overheads: Perl, Prolog, Java• Natural Language Toolkit (NLTK) as a solution

Page 13: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you get...

• Book• Documentation• FAQ• Installation instructions for Python, NLTK, data• Distributions: Windows, Mac OSX, Unix, data,

documentation• CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization,instructions

• Mailing lists:nltk-announce, nltk-devel, nltk-users,nltk-portuguese

Page 14: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you get...

• Book• Documentation• FAQ• Installation instructions for Python, NLTK, data• Distributions: Windows, Mac OSX, Unix, data,

documentation• CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization,instructions

• Mailing lists:nltk-announce, nltk-devel, nltk-users,nltk-portuguese

Page 15: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you get...

• Book• Documentation• FAQ• Installation instructions for Python, NLTK, data• Distributions: Windows, Mac OSX, Unix, data,

documentation• CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization,instructions

• Mailing lists:nltk-announce, nltk-devel, nltk-users,nltk-portuguese

Page 16: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you get...

• Book• Documentation• FAQ• Installation instructions for Python, NLTK, data• Distributions: Windows, Mac OSX, Unix, data,

documentation• CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization,instructions

• Mailing lists:nltk-announce, nltk-devel, nltk-users,nltk-portuguese

Page 17: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you get...

• Book• Documentation• FAQ• Installation instructions for Python, NLTK, data• Distributions: Windows, Mac OSX, Unix, data,

documentation• CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization,instructions

• Mailing lists:nltk-announce, nltk-devel, nltk-users,nltk-portuguese

Page 18: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you get...

• Book• Documentation• FAQ• Installation instructions for Python, NLTK, data• Distributions: Windows, Mac OSX, Unix, data,

documentation• CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization,instructions

• Mailing lists:nltk-announce, nltk-devel, nltk-users,nltk-portuguese

Page 19: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you get...

• Book• Documentation• FAQ• Installation instructions for Python, NLTK, data• Distributions: Windows, Mac OSX, Unix, data,

documentation• CD-ROM: Python, NLTK, documentation, third-party

libraries for numerical processing and visualization,instructions

• Mailing lists:nltk-announce, nltk-devel, nltk-users,nltk-portuguese

Page 20: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 21: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 22: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 23: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 24: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 25: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 26: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 27: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 28: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 29: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Who it is for...

• people who want to learn how to:• write programs• to analyze written language

• does not presume programming abilities:• working examples• graded exercises

• experienced programmers:• quickly learn Python (if necessary)• Python features for NLP• NLP algorithms and data structures

Page 30: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you will learn...

1 how to analyze language data2 key concepts from linguistic description and analysis3 how linguistic knowledge is used in NLP components4 data structures and algorithms used in NLP and linguistic

data management5 standard corpora and their use in formal evaluation6 organization of the field of NLP7 skills in Python programming for NLP

Page 31: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you will learn...

1 how to analyze language data2 key concepts from linguistic description and analysis3 how linguistic knowledge is used in NLP components4 data structures and algorithms used in NLP and linguistic

data management5 standard corpora and their use in formal evaluation6 organization of the field of NLP7 skills in Python programming for NLP

Page 32: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you will learn...

1 how to analyze language data2 key concepts from linguistic description and analysis3 how linguistic knowledge is used in NLP components4 data structures and algorithms used in NLP and linguistic

data management5 standard corpora and their use in formal evaluation6 organization of the field of NLP7 skills in Python programming for NLP

Page 33: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you will learn...

1 how to analyze language data2 key concepts from linguistic description and analysis3 how linguistic knowledge is used in NLP components4 data structures and algorithms used in NLP and linguistic

data management5 standard corpora and their use in formal evaluation6 organization of the field of NLP7 skills in Python programming for NLP

Page 34: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you will learn...

1 how to analyze language data2 key concepts from linguistic description and analysis3 how linguistic knowledge is used in NLP components4 data structures and algorithms used in NLP and linguistic

data management5 standard corpora and their use in formal evaluation6 organization of the field of NLP7 skills in Python programming for NLP

Page 35: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you will learn...

1 how to analyze language data2 key concepts from linguistic description and analysis3 how linguistic knowledge is used in NLP components4 data structures and algorithms used in NLP and linguistic

data management5 standard corpora and their use in formal evaluation6 organization of the field of NLP7 skills in Python programming for NLP

Page 36: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: What you will learn...

1 how to analyze language data2 key concepts from linguistic description and analysis3 how linguistic knowledge is used in NLP components4 data structures and algorithms used in NLP and linguistic

data management5 standard corpora and their use in formal evaluation6 organization of the field of NLP7 skills in Python programming for NLP

Page 37: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK: Your likely goals...

Goals BackgroundArts and Humanities Science and Engineering

LanguageAnalysis

Programming to managelanguage data, explore lin-guistic models, and testempirical claims

Language as a sourceof interesting problems indata modeling, data min-ing, and knowledge dis-covery

LanguageTechnol-ogy

Learning to program, withapplications to familiarproblems, to work in lan-guage technology or othertechnical field

Knowledge of linguis-tic algorithms and datastructures for high quality,maintainable languageprocessing software

Page 38: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Philosophy

• practical• programming• principled• pragmatic• pleasurable• portal

Page 39: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Philosophy

• practical• programming• principled• pragmatic• pleasurable• portal

Page 40: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Philosophy

• practical• programming• principled• pragmatic• pleasurable• portal

Page 41: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Philosophy

• practical• programming• principled• pragmatic• pleasurable• portal

Page 42: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Philosophy

• practical• programming• principled• pragmatic• pleasurable• portal

Page 43: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Philosophy

• practical• programming• principled• pragmatic• pleasurable• portal

Page 44: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Structure

• Three parts:1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification2 Parsing: phrase structure, trees, grammars, chunking,

parsing3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguisticdata management

• each part: chapter on programming; three chapters onNLP

• each chapter: motivation, sections, graded exercises,summary, further reading

Page 45: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Structure

• Three parts:1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification2 Parsing: phrase structure, trees, grammars, chunking,

parsing3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguisticdata management

• each part: chapter on programming; three chapters onNLP

• each chapter: motivation, sections, graded exercises,summary, further reading

Page 46: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Structure

• Three parts:1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification2 Parsing: phrase structure, trees, grammars, chunking,

parsing3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguisticdata management

• each part: chapter on programming; three chapters onNLP

• each chapter: motivation, sections, graded exercises,summary, further reading

Page 47: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Structure

• Three parts:1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification2 Parsing: phrase structure, trees, grammars, chunking,

parsing3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguisticdata management

• each part: chapter on programming; three chapters onNLP

• each chapter: motivation, sections, graded exercises,summary, further reading

Page 48: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Structure

• Three parts:1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification2 Parsing: phrase structure, trees, grammars, chunking,

parsing3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguisticdata management

• each part: chapter on programming; three chapters onNLP

• each chapter: motivation, sections, graded exercises,summary, further reading

Page 49: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Structure

• Three parts:1 Basics: text processing, tokenization, tagging, lexicons,

language engineering, text classification2 Parsing: phrase structure, trees, grammars, chunking,

parsing3 Advanced Topics: selected topics in greater depth:

feature-based grammar, unification, semantics, linguisticdata management

• each part: chapter on programming; three chapters onNLP

• each chapter: motivation, sections, graded exercises,summary, further reading

Page 50: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Python: Key Features

• simple yet powerful, shallow learning curve• object-oriented: encapsulation, re-use• scripting language, facilitates interactive exploration• excellent functionality for processing linguistic data• extensive standard library, incl graphics, web, numerical

processing• downloaded for free from http://www.python.org/

Page 51: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Python: Key Features

• simple yet powerful, shallow learning curve• object-oriented: encapsulation, re-use• scripting language, facilitates interactive exploration• excellent functionality for processing linguistic data• extensive standard library, incl graphics, web, numerical

processing• downloaded for free from http://www.python.org/

Page 52: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Python: Key Features

• simple yet powerful, shallow learning curve• object-oriented: encapsulation, re-use• scripting language, facilitates interactive exploration• excellent functionality for processing linguistic data• extensive standard library, incl graphics, web, numerical

processing• downloaded for free from http://www.python.org/

Page 53: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Python: Key Features

• simple yet powerful, shallow learning curve• object-oriented: encapsulation, re-use• scripting language, facilitates interactive exploration• excellent functionality for processing linguistic data• extensive standard library, incl graphics, web, numerical

processing• downloaded for free from http://www.python.org/

Page 54: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Python: Key Features

• simple yet powerful, shallow learning curve• object-oriented: encapsulation, re-use• scripting language, facilitates interactive exploration• excellent functionality for processing linguistic data• extensive standard library, incl graphics, web, numerical

processing• downloaded for free from http://www.python.org/

Page 55: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Python: Key Features

• simple yet powerful, shallow learning curve• object-oriented: encapsulation, re-use• scripting language, facilitates interactive exploration• excellent functionality for processing linguistic data• extensive standard library, incl graphics, web, numerical

processing• downloaded for free from http://www.python.org/

Page 56: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Python Example

import sysfor line in sys.stdin.readlines():

for word in line.split():if word.endswith(’ing’):

print word

1 whitespace: nesting lines of code; scope2 object-oriented: attributes, methods (e.g. line)3 readable

Page 57: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Comparison with Perl

while (<>) {foreach my $word (split) {

if ($word =~ /ing$/) {print "$word\n";

}}

}

1 syntax is obscure: what are: <> $ my split ?2 “it is quite easy in Perl to write programs that simply look

like raving gibberish, even to experienced Perlprogrammers” (Hammond Perl Programming for Linguists2003:47)

3 large programs difficult to maintain, reuse

Page 58: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to buildNLP programs in Python. It provides:

• Basic classes for representing data relevant to naturallanguage processing

• Standard interfaces for performing tasks, such astokenization, tagging, and parsing

• Standard implementations for each task, which can becombined to solve complex problems

• Demonstrations (parsers, chunkers, chatbots)• Extensive documentation, including tutorials and reference

documentation

Page 59: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to buildNLP programs in Python. It provides:

• Basic classes for representing data relevant to naturallanguage processing

• Standard interfaces for performing tasks, such astokenization, tagging, and parsing

• Standard implementations for each task, which can becombined to solve complex problems

• Demonstrations (parsers, chunkers, chatbots)• Extensive documentation, including tutorials and reference

documentation

Page 60: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to buildNLP programs in Python. It provides:

• Basic classes for representing data relevant to naturallanguage processing

• Standard interfaces for performing tasks, such astokenization, tagging, and parsing

• Standard implementations for each task, which can becombined to solve complex problems

• Demonstrations (parsers, chunkers, chatbots)• Extensive documentation, including tutorials and reference

documentation

Page 61: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to buildNLP programs in Python. It provides:

• Basic classes for representing data relevant to naturallanguage processing

• Standard interfaces for performing tasks, such astokenization, tagging, and parsing

• Standard implementations for each task, which can becombined to solve complex problems

• Demonstrations (parsers, chunkers, chatbots)• Extensive documentation, including tutorials and reference

documentation

Page 62: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to buildNLP programs in Python. It provides:

• Basic classes for representing data relevant to naturallanguage processing

• Standard interfaces for performing tasks, such astokenization, tagging, and parsing

• Standard implementations for each task, which can becombined to solve complex problems

• Demonstrations (parsers, chunkers, chatbots)• Extensive documentation, including tutorials and reference

documentation

Page 63: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial buildingblocks

2 consistency: uniform data structures, interfaces —predictability

3 extensibility: accommodates new components (replicatevs extend exiting functionality)

4 modularity: interaction between components5 well-documented: substantial documentation

Page 64: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial buildingblocks

2 consistency: uniform data structures, interfaces —predictability

3 extensibility: accommodates new components (replicatevs extend exiting functionality)

4 modularity: interaction between components5 well-documented: substantial documentation

Page 65: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial buildingblocks

2 consistency: uniform data structures, interfaces —predictability

3 extensibility: accommodates new components (replicatevs extend exiting functionality)

4 modularity: interaction between components5 well-documented: substantial documentation

Page 66: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial buildingblocks

2 consistency: uniform data structures, interfaces —predictability

3 extensibility: accommodates new components (replicatevs extend exiting functionality)

4 modularity: interaction between components5 well-documented: substantial documentation

Page 67: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK Design: Requirements

1 simplicity: intuitive framework with substantial buildingblocks

2 consistency: uniform data structures, interfaces —predictability

3 extensibility: accommodates new components (replicatevs extend exiting functionality)

4 modularity: interaction between components5 well-documented: substantial documentation

Page 68: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK Design: Non-requirements

1 encyclopedic: has many gaps; opportunity for students toextend it

2 efficiency: not highly optimised for runtime performance3 programming tricks: avoid in preference for clear

implementations (replicate vs extend exiting functionality)

Page 69: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK Design: Non-requirements

1 encyclopedic: has many gaps; opportunity for students toextend it

2 efficiency: not highly optimised for runtime performance3 programming tricks: avoid in preference for clear

implementations (replicate vs extend exiting functionality)

Page 70: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

NLTK Design: Non-requirements

1 encyclopedic: has many gaps; opportunity for students toextend it

2 efficiency: not highly optimised for runtime performance3 programming tricks: avoid in preference for clear

implementations (replicate vs extend exiting functionality)

Page 71: Steven Bird Ewan Klein Edward Loper - SourceForgenltk.sourceforge.net/doc/slides/preface.pdf · Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA ... third-party

Corpora Distributed with NLTK• Australian ABC News, 2 genres, 660k words, sentence-segmented• Brown Corpus, 15 genres, 1.15M words, tagged• CMU Pronouncing Dictionary, 127k entries• CoNLL 2000 Chunking Data, 270k words, tagged and chunked• CoNLL 2002 Named Entity, 700k words, pos- and named-entity-tagged (Dutch, Spanish)• Floresta Treebank, 9k sentences (Portuguese)• Genesis Corpus, 6 texts, 200k words, 6 languages• Gutenberg (sel), 14 texts, 1.7M words• Indian POS-Tagged Corpus, 60k words pos-tagged (Bangla, Hindi, Marathi, Telugu)• NIST 1999 Info Extr (sel), 63k words, newswire and named-entity SGML markup• Names Corpus, 8k male and female names• PP Attachment Corpus, 28k prepositional phrases, tagged as noun or verb modifiers• Presidential Addresses, 485k words, formatted text• Roget’s Thesaurus, 200k words, formatted text• SEMCOR, 880k words, part-of-speech and sense tagged• SENSEVAL 2, 600k words, part-of-speech and sense tagged• Shakespeare XML Corpus (sel), 8 books• Stopwords Corpus, 2,400 stopwords for 11 languages• Switchboard Corpus (sel), 36 phonecalls, transcribed, parsed• Univ Decl Human Rights, 480k words, 300+ languages• US Pres Addr Corpus, 480k words• Penn Treebank (sel), 40k words, tagged and parsed• TIMIT Corpus (sel), audio files and transcripts for 16 speakers• Wordlist Corpus, 960k words and 20k affixes for 8 languages• WordNet, 145k synonym sets