INDEXING TEXT DOCUMENTS ON GPU...Specifically, we concentrate on text processing used to index web documents. We present indexing algorithms for both GPU and CPU and show that GPU

INDEXING TEXT DOCUMENTS ON GPU

CAN YOU INDEX THE WEB IN REAL TIME?

Michael Frumkin (NVIDIA)

HIGH-LEVEL STEPS FOR INDEXING

Periodically collect all webpages

Index the pages by terms or phrases

— ASCII and UTF-8 encoding, remove HTML and XML tags

Generate index

Distribute index to the serving clusters

Serve search queries and feed Knowledge/Intelligence engines

WHAT IS INDEX OF THE WEB ?

Sparse matrix of the size order of 107 x 1011

The columns represent documents, are sorted by page rank

The rows represent terms, are lexicographically sorted

Each matrix element is a list of locations of the term in the document

DATA FLOW IN AN INDEXING CLUSTER

Document

Bucket Document

Batches GPU

CPU

Index

Bucket

Int

ern

et

Inte

rnet

Int

ern

et

Serv

ing C

loud

Document

Bucket Document

Batches GPU

CPU

Index

Bucket

POTENTIAL INDEXING BOTTLENECKS Data size characterization (based on 42 GB Wikipedia)

PCIe limit: docs => device -- do indexing => host (123 K docs/s)

Memory BW limit (1M docs/s):

— Data expansion (word location, line numbers)

— 10 comparisons per word (for average doc)

SM limit: 4-way divergence (800 K docs/s)

Overall upper bound 123 K docs/s (PCIe Gen3)

— achieved 23 K docs/s

1 T docs (20 PB of data) per day on a cluster with 1000 GPUs

SINGLE BOARD FOR INDEXING

Preprocess

Tokenize Map Reduce

Index of a batch

Tokenize Split Bucket Sort Reduce

Indexing on CPU

Indexing on GPU Index Bucket

Index Bucket

Document

Batch

INDEXING ON CPU Process a batch using 12 (p-) threads

Tokenizer

— Splits a line onto terms

Map map<string, set<int> > terms;

for (int i = 0; i < content.size(); ++i) {

vector<string> fields;

Tokenize(content[i], &fields);

for (int j = fields.size(); ++j)

terms[fields[j]].insert(i); }

Reduce

— Single scan over the map and generates the index string

INDEXING ON GPU Tokenizer

— Single thread per doc (parallel version not in this talk)

Splitter

— Single thread per doc

BucketSort

— 32 threads per doc, sensitive to load balancing

Reduce

— Single thread per doc

TOKENIZER

Single scan over the doc

Finds terms boundaries

Computes histogram (224 buckets)

Uses static splitters of the docs

— Quintiles are the ideal splitters, but more expensive

Packing of the location info:

— (term_size & 0xFF) << 24 | (term_offset & 0x00FFFFFF)

Document content is immutable

banana grows

GPU

tokenizer

tree

fast

banana tree grows fast in Hawaii

SPLITTER

Buckets filled by Tokenizer are very uneven, many are empty

Splitter spreads these buckets across 32 big buckets as evenly as possible

Does not deal with outliers (e.g. buckets of the size > 100 K)

Output: 32 ordered buckets of unsorted terms

BUCKET SORT - INNERMOST LOOP Key for performance

Merge(const TermComparator& comparator,

const uint* loc1, const uint* loc2,

uint* scratch, int terms_num) {

int pos1 = 0, pos2 = 0, dst = 0;

for (int i = 0; i < terms_num: ++i) {

if (comparator.less(loc1[pos1], loc2[pos2]) {

scratch[dst] = loc1[pos1++];

} else {

scratch[dst] = loc2[pos2++];

}

++dst;

// Code handling boundary checks

}

}

Small divergence and right data flow key for high GPU performance

OVERLAP OF COMMUNICATIONS WITH COMPUTATIONS

PCIe is the resource that will be saturated first (after additional 5x speedup)

Send each batch to its own stream

Use cudaMemcpyAsync

This should fully hide any communications

Can’t use cudaFree for multi-streaming. It serializes pending streams

Hence we can’t use cudaMalloc, cudaMallocHost

=> Custom Memory Management

MEMORY POOL Must use cudaMalloc and cudaMallocHost only in the beginning

of the program

MemPool class for Host and Device

— Calls once cudaMalloc and cudaMallocHost in the beginning

— Overloads other call to cudaMalloc and cudaMallocHost

Minimal change to the code

Double buffering

Trade-in

— num_batches_in_flight * MemPool_size < GGR_size

MULTI-STREAMING WIKIPEDIA BUCKET 0

MULTI-STREAMING WIKIPEDIA BUCKET 13

KEYS FOR HIGH PERFORMANCE

Coalesce CPU/GPU IO

Overlap I/O with computations

Minimize random access, use Shared Memory

0.00E+00

1.00E+01

2.00E+01

3.00E+01

4.00E+01

5.00E+01

6.00E+01

7.00E+01

8.00E+01

0 5 10 15

12 coresSandy Bridge

K20Xm

CPU no IOoverlap

K20C opt 1

K20C IOoverlap

K20C opt 2

Bucket Number

Tim

e in

s

Bucket Number

Tim

e in se

c

PERFORMANCE: CPU VS GPU

Literature collection: 4200 docs, 92 MB 3.1 x faster

Wikipedia: about 7 M docs, 42 GB 2.2 x faster

0.00E+00

5.00E+00

1.00E+01

1.50E+01

2.00E+01

2.50E+01

3.00E+01

3.50E+01

4.00E+01

4.50E+01

5.00E+01

0 2 4 6 8 10 12 14 16

12 cores SandyBridge

K20Xm

Bucket Number

Tim

e in

sec

LUCENE: JAVA CUDA INTERFACE Lucene/Solr Apache projects Indexing/Search

LucidWorks develops commercial indexing and search engines based on Lucene/Solr

GPU indexer worked smoothly with Java

Indexer.java Indexer.so Indexer

CPU GPU

C++ C++CUDA Docs Docs

Index Index

SUMMARY CPU: STL based map, set, fully parallel

GPU: BucketSort

Document sets:

— Literature collection: 4200 docs

— Wikipedia: 7 M docs split into 16 buckets

GPU 3.1x faster on Literature collection and 2.2x faster on Wikipedia

— K20Xm, 14 SMs, 732 MHz, vs i7 6 cores (2x hyper-threaded) @ 3.2 GHz

— 3.4 K docs/s (Literature) 23.1 K docs/s (Wikipedia)

Theoretical limiting resource: PCIe Gen3 - 123K docs/s

Indexing Wikipedia on 16 GPUs in 31 s (19 s average per bucket)

Questions ?

Document

Bucket GPU

CPU

Index

Bucket

Int

ern

et

Inte

rne

t

Document

Bucket GPU

CPU

Index

Bucket My

Analytics

Engine

Documents

INDEXING TEXT DOCUMENTS ON GPU...Specifically, we concentrate on text processing used to index web documents. We present indexing algorithms for both GPU and CPU and show that GPU