Upload
pydata
View
576
Download
0
Tags:
Embed Size (px)
DESCRIPTION
This talk covers rapid prototyping of a high performance scalable text processing pipeline development in Python. We demonstrate how Python modules, in particular from the Rosetta library, can be used to analyze, clean, extract features, and finally perform machine learning tasks such as classification or topic modeling on millions of documents. Our style is to build small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing library.
Citation preview
(Easy), High Performance Text Processing withPython’s Rosetta
Daniel Krasner
KFit Solutions, Columbia University
Nov 22, 2014
(IDSE) 1 / 52
Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 2 / 52
Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 3 / 52
Motivation: Current Text Processing Projects
The Declassification EngineI Full stack Digital Archive.
F Collections structuring/parsing.F Backend, API, UI.F Statistical analysis.
I OrganizersF David Madigan - Statistics Department chair (CU)F Matthew Connelly - Professor of international and global history (CU)
I For more info see http://www.declassification-engine.org/.
(IDSE) 4 / 52
Motivation: Current Projects Continued
eDiscoveryI The legal world is overwhelmed with documents, both in pre and post
production review.I Most technologies heavily rely on keyword search which is not efficient.I “Predictive coding” solutions are generally archaic and inaccurate.
OtherI Human - text/document interaction.I Semantic filtering solutions.
(IDSE) 5 / 52
Motivation: Data Structuring/Information ExtractionText can come in many formats, encodings, and degree of structure.
Figure: raw xml
(IDSE) 6 / 52
Motivation: Data Structuring/Information ExtractionMany tasks involve initial structuring of the data.
Figure: structured api response
(IDSE) 7 / 52
Motivation: Network AnalysisKissinger telcons can be analyzed for frequency. Even a simple version ofthis type of analysis requires the data to be structured, or an informationextraction process in place.
Figure:
(IDSE) 8 / 52
Motivation: Semantic Modeling
State department cables from embassies can be analyzed for topics.
Moscow is predominantly topic 12
soviet 0.133910moscow 0.128717october 0.090400joint 0.052875ussr 0.044190soviets 0.042493ur 0.027686mfa 0.025786refs 0.023871prague 0.021268
London is predominantly topic 13
london 0.114568bonn 0.083748rome 0.074385uk 0.051367frg 0.050235berlin 0.035972usmission 0.031757british 0.029836european 0.027203brussels 0.025023
(IDSE) 9 / 52
Motivation: ClassificationDetermine which documents are relevant to a legal case.
Figure: eDiscovery processing pipeline
(IDSE) 10 / 52
Feature Extraction
Figure: Metadata + body text ⇒ features
The metadata features can be used in any classifier
(IDSE) 11 / 52
Tasks
What are some typical text processing goals?Structuring/Information Extraction
I Metadata extractionF Geo-taggingF Name-entity identification/disambiguation
I Text cleaning/text body extractionMachine Learning
I ClassificationF logistic regression, random forests, etc
I Sentiment analysisI Recommendation systemsI Anomaly detection
F Understanding underlying semantic structute (ex LDA/LSI modeling)F Communication dynamics
Note: this is far from a complete list!
(IDSE) 12 / 52
General Flow
Figure: General Flow
(IDSE) 13 / 52
What’s so hard about text processing?
Must use sparse data structuresData usually doesn’t fit in memoryIt’s language . . .Text structure can change from collection to collection(parsing/feature extraction can be tricky)Can be difficult to convert to nice machine-readable formatsHUGE data drives much of software development. Leads to productstoo complicated for most applicationsNo simple solution (e.g. “sklearn/statsmodels/pandas/numpy” stack)
(IDSE) 14 / 52
What’s so fun about text processing?
You get to care about memory and processing speedIt’s language
I spelling, stemming, etcI parsingI tokenizationI domain knowledge
Unix plays nicely with textPython plays nicely with textYou get to step outside the Python ecosystem
(IDSE) 15 / 52
Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 16 / 52
Cluster? No.
One powerful machine and memory/CPU conscious code can handle manytasks with 1TB of text.
System76 Laptop, pre-loaded with Ubuntu. 16GB Memory, 4cores,1T SSD. $1823
I Can also get a Macbook Pro for about $3600You can spend 6k and get 20 cores, 64GB Memory (upgradeable to256GB), 1-2T RAID SSD for $7,000. Stick it in a closet with lots offans or in a server center for $250/month.Single machine on AWS ($1-2k per year) or on Digital Ocean (theyhave nice ssds).
(IDSE) 17 / 52
Back-of-the-envelope memory calculations
Text:I Same as on disk if the file is large enough (e.g. 10M)I Can be much more if you load many small files and append to a list.
Numbers:I 1double = 8byte. ⇒ 1,000,000 doubles ≈ 8MB.I You can save space by using type “int8,” “float16,” “float32,”
etc. . . see the dtype docs
(IDSE) 18 / 52
Monitoring Memory with HTOP
4 cores, 4 virtual cores, all in use2821/15946 MB memory in useprocesses listed below
For macs, htop doesn’t necessarily work. If it doesn’t, try iStat.(IDSE) 19 / 52
Don’t blow up memory, stream!Process files (text) line-by-line:with open(infile, ‘‘r’’) as f, open(outfile, ‘‘w’’) as g:
for line in input:line_output = process_line(line)# Write output NOWg.write(line_output)
Process directories one file at a time:from rosetta.text.filefilter import get_paths
my_paths_iter = get_paths(MYDIR, get_iter=True)
for path in my_paths_iter:output = process_file(path)# Now write the output to file
(IDSE) 20 / 52
Rosetta Text File Streamer
Set up a text file streamer class for processing.# stream from a file sys directorystream = TextFileStreamer(text_base_path=MYDIR,
tokenizer=MyTokenizer)
# call info stream which will return a dict with# doc_id, text, tokens, etcfor item in stream.info_stream():
# print the document textprint item[‘‘text’’]‘‘This is my text.’’# print the tokensprint item[‘‘tokens’’][‘‘this’’, ‘‘is’’, ‘‘my’’, ‘‘text’’]
(IDSE) 21 / 52
Rosetta MySQL Streamer
Set up a databse streamer class for processing.# stream from a DBstream = MySQLStreamer(db_setup=DBCONFIG,
tokenizer=MyTokenizer)
# Convert to scipysparse matrix# and cache some data along the waysparse_mat = stream.to_scipysparse(
cache_list[‘‘doc_id’’, ‘‘date’’])
# grab the cached doc_id and datesdoc_ids = stream.__dict__[‘‘doc_id_cache’’]dates = stream.__dict__[‘‘date_cache’’]
(IDSE) 22 / 52
(Online) Stochastic Gradient DescentCombine the above with an online learning algorithm.To minimize empirical loss
n∑i=1|yi − w · xi |2
Update the coefficient w one training example at a time
w (t+1) : = w t − ηt∇w |yt − w t · xt |2.
Note:The learning rate ηt decays to 0We can cycle through the training examples, updating the weightsmore than n timesWe only need to load one single example into memory at a timeConverges faster for cases of many data points and coefficients
See Bottou scikit-learn sgd and vowpal wabbit.(IDSE) 23 / 52
Dealing with limited memory: Summary
Monitor memory usageStream process
I and cache what you need along the wayDeal with huge feature counts by. . .
I Use a sparse data structure, stochastic gradient descent. Or. . .I Reduce the number of features to something that fits into a dense
matrix
(IDSE) 24 / 52
Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 25 / 52
Goal: Tokenization
# Steps: split on whitespace, set to lowercase# remove non-letters, punctuation, and stopwordsstopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And morepunct = [’|’, ’‘’, ’[’, ’]’, ’{’, ’}’, ’(’, ’)’]
# Here’s the function call/result we wanttext = "Let’s do a deal: Trade 55 Euros for 75 euros"tokens = tokenize(text, punct, stopwords_list)print tokens[‘‘deal’’, ‘‘trade’’, ‘‘euros’’, ‘‘euros’’]
(IDSE) 26 / 52
First hack: For loop
def tokenize_for(text, punct, stopwords):tokens = []# Split the text on whitespace.for token in text.lower().split():
# Remove punctuationclean_token = ’’for char in clean_token:
if char not in punct:clean_token += char
# Remove stopwords.if clean_token.isalpha() and len(token) > 1\
and (token not in stopwords):tokens.append(token)
return tokens
(IDSE) 27 / 52
Profile: Time your coderegex test.pydef main():
stopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And moretext = # Something pretty typical for your applicationfor i in range(10000):
tokens = tokenize_for(text, stopwords_list)
if __name__ == ‘‘__main__’’:main()
time python regex_test.py
real 0m0.128suser 0m0.120ssys 0m0.008s
(IDSE) 28 / 52
Profile: Switch to a regexbad_char_pattern = r"\||’|\[|\]|\{|\}|\(|\)"
def tokenize_regex_1(text, bad_char_pattern, stopwords):# Substitute empty string for the bad characterstext = re.sub(bad_char_pattern, ’’, text).lower()
# Split on whitespace, keeping strings length > 1split_text = re.findall(r"[a-z]+", text)tokens = []for word in split_text:
if word not in stopwords:tokens.append(word)
return tokens
time python regex_test.py
real 0m0.091suser 0m0.083ssys 0m0.008s
(IDSE) 29 / 52
Profile: Line-by-line readoutAdd an @profile decorator to your tokenize regex function, andpip install line_profilerkernprof.py -l regex_test.pypython -m line_profiler regex_test.py.lprof |less
Figure: line profiler output shows the for loop and if statement are slow.
(IDSE) 30 / 52
Profile: Use a setregex test.py
def main():stopwords_list = ’lets,do,a,the,and’.split(’,’) # And morestopwords_set = set(stopwords_list)for i in range(1000):
text = # Something pretty typical for your applicationtokens = tokenize_regex_1(text, stopwords_set)
if __name__ == ’__main__’:main()
Reduces time from 0.091 to 0.043Testing item in my set requires hash function computationTesting item in my list requires looking at every item inmy list.
(IDSE) 31 / 52
Profile: IPython timeit
Figure: Set lookup is O(1). So don’t ever test item in my list for long lists.
(IDSE) 32 / 52
Profile: Line-by-line profile again
Switching to a set speed up the if statement.The for loop can still be faster
(IDSE) 33 / 52
Profile: Switch to a list comprehension
Reduced time from 0.033 to 0.025sA list comprehension is essentially a for loop with a fast appendLooks nicer in this caseBe sure to use time python regex test.py for the total time!
(IDSE) 34 / 52
Data structures
Think about the data structures (and associated methods) you are usingfor the task at hand!
Many data analysis friendly languages (ex. Python, R) have veryconvenient built in data structures.These can come with significant lookup and operation overhead.
I example: set lookup vs list lookup as aboveI example: python does not allocate a contiguous memory block for
dictionaries, making them slower than a data structure which tells theinterpreter how much space will be needed
You can (easily) create your own data structure for the task at hand.See Saulius Lukuaskas nice post Why Python Runs Slow. Part 1:Data structures.
(IDSE) 35 / 52
Parallelization
Much of text processing is embarrassingly parallel.
Figure: Word counts for individual documents can be computed independently.
(IDSE) 36 / 52
Parallelization: Basic mapping
Serial mapping>>> def func(x):... return 2 * x>>> iterable = range(3)>>> map(func, iterable)[0, 2, 4]
Parallel mapping>>> from multiprocessing import Pool>>> def func(x):... return 2 * x>>> my_pool = Pool(processes=4)>>> iterable = range(3)>>> my_pool.map(func, iterable)[0, 2, 4]
1 Spawns 4 subprocesses2 Pickles func, iterable and pipes them to the subprocess3 The subprocesses compute their results4 Subprocesses pickle/pipe back the results to the mother process
(IDSE) 37 / 52
Parallelization: Basic mapping issues
Serial mapping>>> def func(x):... return 2 * x>>> map(func, range(3))[0, 2, 4]
Parallel mapping>>> from multiprocessing import Pool>>> def func(x):... return 2 * x>>> my_pool = Pool(processes=4)>>> iterable = range(3)>>> my_pool.map(func, iterable)[0, 2, 4]
Issues:What about functions of more than one variable?Pickling not possible for every functionCan’t step a debugger into pool callsTraceback is uninterpretableCan’t exit with Ctrl-CEntire result is computed at once ⇒ memory blow-up!
(IDSE) 38 / 52
Parallelization: Mapping functions of more than one var
>>> from multiprocessing import Pool>>> from functools import partial
>>> def func(a, x):... return 2 * a * x
>>> a = 3>>> func_a = partial(func, a)# func_a(x) = func(a, x)
>>> Pool.map(func_a, range(3))[0, 6, 12]
(IDSE) 39 / 52
Parallelization: Dealing with map issues
def map_easy(func, iterable, n_jobs):
if n_jobs == 1:return map(func, iterable)
else:_trypickle(func)pool = Pool(n_jobs)timeout = 1000000return pool.map_async(func, iterable).get(timeout)
trypickle(func) tries to pickle the func before mappingn jobs = 1 ⇒ serial (debuggable/traceable) executionpool.map async(func, iterable).get(timeout)allows exit with Ctrl-C
(IDSE) 40 / 52
Parallelization: Limiting memory usageSend out/return jobs in chunksdef imap_easy(func, iterable, n_jobs, chunksize,
ordered=True)if n_jobs == 1:
results_iter = itertools.imap(func, iterable)else:
_trypickle(func)pool = Pool(n_jobs)if ordered:
results_iter = pool.imap(func, iterable,chunksize=chunksize)
else:results_iter = pool.imap_unordered(
func, iterable, chunksize=chunksize)
return results_iterNote: Exit with Ctrl-C is more difficult. See rosetta.parallel
(IDSE) 41 / 52
Making things faster: Summary
Use regular expressionsUse the right data structure
I Numpy/Pandas for numbers (use built in functions/numba/cython,NOT for loops)
I sets if you will test some item in my set
Profile your codeI time python myscript.pyI timeit in IPythonI line profiler (using kernprof.py)
Use multiprocessing.PoolA number of the Rosetta streamer methods have multiprocessingbuilt in (see rosetta.text.streamers)NOTE: the above example are in python for convenience but arerelevant in many (most) other scenarios
(IDSE) 42 / 52
Outline
1 Introduction
2 Dealing with Limited Memory
3 Making Things Faster!
4 Stepping Outside of Python (time permitting)
(IDSE) 43 / 52
LDA (in a slide)
Latent Dirichlet Allocation, by Blei, Ng and Jordan, is a hierarchicalBayesian model which describes the underlying semantic structure of adocument corpus via a set of latent distributions of the vocabulary.
The latent semantic distributions are referred to as “topics.”Each document is assumed modeled as a mixture of these topics.
I the number of topics is chosen a priori.Words in a document are draw by
I choosing a topic, given document mixture weights,I sampling from that topic.
Hyperparameters:I lda alpha: prior which controls the topic probabilities/weights.
F lda alpha 0.1: θd Dirichlet(α)I lda rho: prior which controls the word probabilities.
F lda alpha 0.1: βk Dirichlet(ρ)
(IDSE) 44 / 52
Vowpal Wabbit: What/Why
Can you build a topic model with 1,000,000 documents using gensim?Sure. . . if you have 10 hours or so to kill
Better solution: Vowpal WabbitOnline stochastic gradient descent ⇒ memory independent, optimalfor huge data setsHighly optimized C++ ⇒ fast
However. . .Interface is CLI and the input/output files are not very usable
(IDSE) 45 / 52
Vowpal Wabbit: Python to the rescuePrinciples:
Make getting data into/out of VW easyDon’t wrap the VW CLI (or if you do use the subprocess module tomake calls, not os.system)
# Convert text files in a directory structure to vw formatstream = TextFileStreamer(
text_base_path=’bodyfiles’, tokenizer=my_tokenizer)stream.to_vw(’myfiles.vw’, n_jobs=-1)
# Explore token counts and filter tokens in a DataFramesff = SFileFilter(VWFormatter()).load_sfile(’myfiles.vw’)
# Create a data-frame representativedf = sff.to_frame()
(IDSE) 46 / 52
Vowpal Wabbit: Python to the rescue
# Create a filtered version of your sparse filesff.filter_extremes(doc_freq_min=5, doc_fraction_max=0.8)sff.compactify().filter_sfile(’myfiles.vw’, ’myfiles_filtered.vw’)
# Back to bash to run VWvw --lda 5 --cache_file ddrs.cache --passes 10 \
-p prediction.dat --readable_model topics.dat \--bit_precision 16 myfiles_filtered.vw
# Look at the results in DataFrameslda = LDAResults(
’topics.dat’, ’prediction.dat’, num_topics, sff)lda.print_topics()
See rosetta/examples/vw helpers.md
(IDSE) 47 / 52
Steps with VW
Step 1: Convert files to VW input format1 0000BC34| saying:1 antunes:4 goncalves:3 scientist:1 ...1 0000C1AE| shot:1 help:1 september:2 luxembourg:1...1 0000BBA7| raised:1 chinese:1 winston:1 authority:1...
step 2: View the tokens in a DataFramedoc_freq
tokenswar 58china 77...
Step 3: Filter tokens and hash them1 0000BC34| 3423211:1 111:4 43454:3 989794:1 ...1 0000C1AE| 338:1 3123:1 19393:2 3232321:1...1 0000BBA7| 1191:1 69830:1 398:1 974949:1...
(IDSE) 48 / 52
Steps with VW
Step 4: Run VW vw --lda 5 --cache file ddrs.cache --passes10...
Step 5: View the results
topic_0 topic_1tokenswar 0.2 0.8china 0.4 0.6
See rosetta/examples/vw helpers.md
(IDSE) 49 / 52
Summary
Pay attention to memoryPay attention to data structuresProfile for performanceParallelization is easy for many text-processing tasksUse Python to make stepping outside the python world easierAlso, don’t forget CLI and UNIX
(IDSE) 50 / 52
Bibliography
M. Connelly et al. . .Declassification engine. Ongoing project at Columbia Universityhttp://www.declassification-engine.org/https://github.com/declassengine/declass
D. Krasner and I. LangmoreApplied data science, lecture notes http://columbia-applied-data-science.github.io/appdatasci.pdf
The Rosetta teamTools for data science with a focus on text processing.https://github.com/columbia-applied-data-science/rosetta
Clone, submit issues on github, fork, contribute!
(IDSE) 51 / 52
THANK YOU!
contact: [email protected]
Rosettahttps://github.com/
columbia-applied-data-science/rosettaOpen Source Python Text Processing Library
Feel free to use, fork, submit issues and contribute!
(IDSE) 52 / 52