Large scale nlp using python's nltk on azure

Preview:

Citation preview

Beat Schweglerhead in the cloud feet on the ground

Twitter: @cloudbeatsch Blog: http://cloudbeatsch.com

I saw Mr. Washington with a saw!large scale NLP using python's NLTK on Azure

I saw Mr. Washington.This is your saw… I told you!Is this really a chainsaw?

fundamentals of nlp

natural language toolkit (nltk)

running python and nltk on Azure

source: http://www.nltk.org/book_1ed/ch01.html

simple pipeline architecture for a spoken dialogue system

dialogue with a chatbot

identify languagetokenize & tag part of speech (pos)identify named entities

corpora and lexical resourcescorpus is a large body of textlexical resource is a collection words associated with additional information

e.g. brown corpusfirst million-word electronic corpus of english, created in 1961 at brown university

segmentationtokenizetag part of speech (pos)identify named entities

source: http://www.nltk.org/book_1ed/ch07.html

entity detection using chunking

fundamentals of nlp

natural language toolkit (nltk)

running python and nltk on Azure

text as a sequence of words and punctuation represented as a list

sent = [‘I', ‘love', ‘Dublin', ‘!']upper_sent = [w.upper() for w in

sent]

downloading corpus and lexical resourcesnltk.download(‘all’)nltk.download(‘brown’)

segment text into sentencesfrom nltk.tokenize import sent_tokenizesent_tokenize_list = sent_tokenize(text)

tokenize sentencefrom nltk.tokenize import word_tokenizetokens = word_tokenize(sentence)

tag part of speech (pos)tags = nltk.pos_tag(tokens)

identify named entitiesentities = nltk.ne_chunk(tags)entities.draw()

demo

language recognition import langidlang = langid.classify(text)[0]

fundamentals of nlp

natural language toolkit (nltk)

running python and nltk on Azure

azure cloud services azure webjobsazure functions

azure cloud services & pythonpip’s requirements.txtPowerShell scripts for setup and launch

azure webjobs & pythonupload zip (inc. dependencies)runs run.py (or the first py file it finds)

configuration settings key = os.environ["STORAGE_KEY"]

publish webjobpip packages into site-packages zip application (inc. depended packages)upload zip file

add package location to sys.pathp = os.path.join(os.getcwd(), "site-packages")sys.path.append(p)

downloading corpusD:\\local\\AppData\\nltk_dataif os.getenv("DOWNLOAD", True) == True : dest = os.environ[“NLTK_DATA_DIR"] nltk.download('all', dest)

using queues for communicationreads text from input queue writes processed text into output queues

auto scalebased on queue length

debugging python webjobslocal: vs and webjob simulatorcloud: use kudu (xyz.scm.azurewebsites.net) and logs

demo

in closing…

nltk is a great toolkit to perform nlp tasksazure provides an elastic and scalable platform to run python nltk jobs

http://www.nltk.org/ http://www.nltk.org/book_1ed

http://azure.com/

Recommended