150
James Saryerwinnie A case study in multi-threading, multi-processing, and asyncio Downloading a Billion Files in Python @jsaryer

Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

J a m e s S a r y e r w i n n i e

A case study in multi-threading, multi-processing, and asyncio

Downloading a Billion Files in Python

@ j s a r y e r

Page 2: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task

Page 3: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task

There is a remote server that stores files

Page 4: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task

There is a remote server that stores files

The files can be accessed through a REST API

Page 5: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task

There is a remote server that stores files

The files can be accessed through a REST API

Our task is to download all the files on the remote server to our client machine

Page 6: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

Page 7: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

What client machine will this run on?

Page 8: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory

What client machine will this run on?

Page 9: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory

What client machine will this run on?

What about the network between the client and server?

Page 10: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory

Our client machine is on the same network as the service with remote files

What client machine will this run on?

What about the network between the client and server?

Page 11: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory

Our client machine is on the same network as the service with remote files

What client machine will this run on?

What about the network between the client and server?

How many files are on the remote server?

Page 12: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory

Our client machine is on the same network as the service with remote files

Approximately one billion files, 100 bytes per file

What client machine will this run on?

What about the network between the client and server?

How many files are on the remote server?

Page 13: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory

Our client machine is on the same network as the service with remote files

Approximately one billion files, 100 bytes per file

What client machine will this run on?

What about the network between the client and server?

How many files are on the remote server?

When do you need this done?

Page 14: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory

Our client machine is on the same network as the service with remote files

Approximately one billion files, 100 bytes per file

What client machine will this run on?

What about the network between the client and server?

How many files are on the remote server?

Please have this done as soon as possible

When do you need this done?

Page 15: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Files

Page Page Page

File Server Rest API

Page 16: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

File Server Rest API GET /list

Files

Page Page

FileNames

NextMarker

Page 17: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

File Server Rest API GET /list

Files

Page Page

FileNames

NextMarker

{"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}

Page 18: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

File Server Rest API GET /list?next-marker=token

Files

Page Page

FileNames

NextMarker

Page 19: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

File Server Rest API GET /list?next-marker=token

Files

Page Page

FileNames

NextMarker

{"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}

Page 20: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

File Server Rest API

GET /list

GET /get/{filename}

{"FileNames": ["file1", "file2", ...]}

{"FileNames": ["file1", "file2", ...], "NextMarker": "pagination-token"}

(File blob content)

GET /list?next-marker={token}

Page 21: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Caveats

This is a simplified case study.

The results shown here don't necessarily generalize.

Not an apples to apples comparison, each approach does things slightly different

Sometimes concrete examples can be helpful

Page 22: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Caveats

This is a simplified case study.

The results shown here don't necessarily generalize.

Not an apples to apples comparison, each approach does things slightly different

Always profile and test for yourself

Sometimes concrete examples can be helpful

Page 23: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous Version

Simplest thing that could possibly work.

Page 24: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

Page Page Page

Page 25: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 26: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 27: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 28: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 29: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 30: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 31: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 32: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

Page

Page 33: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

Page

Page 34: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

Page

Page 35: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

Page

Page 36: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

Page

Page 37: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

Page

Page 38: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

Page

Page 39: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 40: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)

Page 41: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)

Page 42: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)

Page 43: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)

Page 44: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 45: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_file(remote_url, local_filename): response = requests.get(remote_url) response.raise_for_status() with open(local_filename, 'wb') as f: f.write(response.content)

Page 46: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous Results

Page 47: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.003 seconds

Synchronous Results

Page 48: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.003 seconds

One billion requests 3,000,000 seconds

Synchronous Results

Page 49: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

833.3 hours

One request 0.003 seconds

One billion requests 3,000,000 seconds

Synchronous Results

Page 50: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

833.3 hours34.7 days

One request 0.003 seconds

One billion requests 3,000,000 seconds

Synchronous Results

Page 51: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

Page 52: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

List Files can't be parallelized.

queue.Queue But Get File can be parallelized.

Page 53: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

List Files can't be parallelized.

queue.Queue But Get File can be parallelized.

Page 54: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

List Files can't be parallelized.

One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue But Get File can be parallelized.

Page 55: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

List Files can't be parallelized.

WorkerThread-1

WorkerThread-2

WorkerThread-3One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue But Get File can be parallelized.

Page 56: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

List Files can't be parallelized.

WorkerThread-1

WorkerThread-2

WorkerThread-3One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue But Get File can be parallelized.

Page 57: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

List Files can't be parallelized.

WorkerThread-1

WorkerThread-2

WorkerThread-3One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue But Get File can be parallelized.

Page 58: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

List Files can't be parallelized.

WorkerThread-1

WorkerThread-2

WorkerThread-3One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue

Results Queue

Result thread prints progress, tracks overall results, failures, etc.

Page 59: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir, num_threads): # ... same constants as before ...

work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE)

threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread)

# ...

response = requests.get(list_url)

Page 60: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir, num_threads): # ... same constants as before ...

work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE)

threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread)

# ...

response = requests.get(list_url)

Page 61: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' outfile = os.path.join(outdir, filename) work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 62: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' outfile = os.path.join(outdir, filename) work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 63: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)

Page 64: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)

Page 65: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreaded Results - 10 threads

Page 66: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.0036 seconds

Multithreaded Results - 10 threads

Page 67: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.0036 seconds

One billion requests 3,600,000 seconds1000.0 hours

41.6 days

Multithreaded Results - 10 threads

Page 68: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreaded Results - 100 threads

Page 69: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.0042 seconds

Multithreaded Results - 100 threads

Page 70: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.0042 seconds

One billion requests 4,200,000 seconds1166.67 hours

48.6 days

Multithreaded Results - 100 threads

Page 71: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Why?

Not necessarily IO bound due to low latency and small file size

GIL contention, overhead of passing data through queues

Page 72: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Things to keep in mind

The real code is more complicated, ctrl-c, graceful shutdown, etc.

Debugging is much harder, non-deterministic

The more you stray from stdlib abstractions, more likely to encounter race conditions

Can't use concurrent.futures map() because of large number of files

Page 73: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

Page 74: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory

Our client machine is on the same network as the service with remote files

Approximately one billion files, 100 bytes per file

What client machine will this run on?

What about the network between the client and server?

How many files are on the remote server?

Please have this done as soon as possible

When do you need this done?

Page 75: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

WorkerProcess-1

WorkerProcess-2

WorkerProcess-3Download one page at a time in parallel across multiple processes

Page 76: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

WorkerProcess-1

WorkerProcess-2

WorkerProcess-3Download one page at a time in parallel across multiple processes

Page 77: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

WorkerProcess-1

WorkerProcess-2

WorkerProcess-3Download one page at a time in parallel across multiple processes

Page 78: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

WorkerProcess-1

WorkerProcess-2

WorkerProcess-3Download one page at a time in parallel across multiple processes

Page 79: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

from concurrent import futures

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list'

all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()

Page 80: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

from concurrent import futures

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list'

all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()

Page 81: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

from concurrent import futures

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list'

all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()

Start parallel downloads

Page 82: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

from concurrent import futures

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list'

all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()

Wait for downloads to finish

Page 83: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def iter_all_pages(list_url): session = requests.Session() response = session.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: yield content['FileNames'] if 'NextFile' not in content: break response = session.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 84: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

class Downloader: # ...

def download(self, filename): remote_url = f'{self.get_url}/{filename}' response = self.session.get(remote_url) response.raise_for_status() outfile = os.path.join(self.outdir, filename) with open(outfile, 'wb') as f: f.write(response.content)

Page 85: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing Results - 16 processes

Page 86: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.00032 seconds

Multiprocessing Results - 16 processes

Page 87: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.00032 seconds

One billion requests 320,000 seconds

88.88 hours

Multiprocessing Results - 16 processes

Page 88: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.00032 seconds

One billion requests 320,000 seconds

88.88 hours

Multiprocessing Results - 16 processes

3.7 days

Page 89: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Things to keep in mind

Speed improvements due to truly running in parallel

Debugging is much harder, non-deterministic, pdb doesn't work out of the box

IPC overhead between processes higher than threads

Tradeoff between entirely in parallel vs. parallel chunks

Page 90: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 91: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Page 92: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Page 93: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Page 94: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Page 95: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Page 96: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Page 97: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Page 98: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Page 99: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Page 100: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Page 101: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 102: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 103: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 104: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 105: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 106: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 107: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 108: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 109: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 110: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 111: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

Page 112: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Move on to the next page and start creating tasks.

Meanwhile tasks from the first page will finish downloading their file.

All in a single process

All in a single thread

Switch tasks when waiting for IO

Should keep CPU busy

Page 113: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

import asynciofrom aiohttp import ClientSessionimport uvloop

async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)

Page 114: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

import asynciofrom aiohttp import ClientSessionimport uvloop

async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)

Page 115: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

import asynciofrom aiohttp import ClientSessionimport uvloop

async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)

Page 116: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

import asynciofrom aiohttp import ClientSessionimport uvloop

async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)

Page 117: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

import asynciofrom aiohttp import ClientSessionimport uvloop

async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)

Page 118: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())

Page 119: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())

Page 120: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename

Page 121: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename

Page 122: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio Results

Page 123: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.00056 seconds

Asyncio Results

Page 124: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.00056 seconds

One billion requests 560,000 seconds155.55 hours

6.48 days

Asyncio Results

Page 125: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

SummaryApproach SingleRequestTime(s) Days

Synchronous 0.003 34.7

Multithread 0.0036 41.6

Multiprocess 0.00032 3.7

Asyncio 0.00056 6.5

Page 126: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio and Multiprocessing

Page 127: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio and Multiprocessing

and Multithreading

Page 128: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

Page 129: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

Thread-2

Thread-1

Page 130: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

Page 131: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

Page 132: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

The Input/Output queues contain pagination tokens

foo

Page 133: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

The Input/Output queues contain pagination tokens

foo

Page 134: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

foo

The main thread of the worker process is a bridge to the event loop running on a separate thread. It sends the pagination token to the async Queue.

Page 135: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

foo

The event loop makes the List call with the provided pagination token "foo".

Page 136: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

foo

The event loop makes the List call with the provided pagination token "foo".

{"FileNames": [...], "NextMarker": "bar"}

Page 137: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

The next pagination token "bar", eventually makes its way back to the main process.

Page 138: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

The next pagination token "bar", eventually makes its way back to the main process.

Page 139: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

While another process starts goes through the same steps, WorkerProcess-1 is downloading 1000 files using asyncio.

Page 140: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

Page 141: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

We get to leverage all our cores.1.

Page 142: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

We get to leverage all our cores.1.

We download individual files efficiently with asyncio.

2.

Page 143: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

We get to leverage all our cores.1.

We download individual files efficiently with asyncio.

2.

Minimal IPC overhead, only passing pagination tokens across processes, only

one per thousand files.

3.

Page 144: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Combo Results

Page 145: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.0000303 seconds

Combo Results

Page 146: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.0000303 seconds

One billion requests 30,300 seconds

Combo Results

Page 147: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.0000303 seconds

8.42 hoursOne billion requests 30,300 seconds

Combo Results

Page 148: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Summary

Approach SingleRequestTime(s) Days

Synchronous 0.003 34.7

Multithread 0.0036 41.6

Multiprocess 0.00032 3.7

Asyncio 0.00056 6.5

Combo 0.0000303 0.35

Page 149: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Tradeoff between simplicity and speed

Multiple orders of magnitude difference based on approach used

Lessons Learned

Need to have max bounds when using queueing or any task scheduling

Page 150: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Thanks!

J a m e s S a r y e r w i n n i e @ j s a r y e r