Maintaining Spatial Data Infrastructures (SDIs) 2017 ... Maintaining Spatial Data Infrastructures

  • View

  • Download

Embed Size (px)

Text of Maintaining Spatial Data Infrastructures (SDIs) 2017 ... Maintaining Spatial Data Infrastructures

  • Maintaining Spatial Data Infrastructures (SDIs) using distributed task queues

    Paolo Corti and Ben LewisHarvard Center for Geographic Analysis

    2017 FOSS4GBoston

  • Background

    Harvard Center for Geographic Analysis

    WorldMap Biggest GeoNode instance on the planet

    HHypermap Map service registry


  • NoteBillion Object Platform (BOP)

  • Demo of WorldMap / HHypermap


  • The need for an asynchronous processor

    In WorldMap and HHypermap there are operations run by users which are time consuming and cannot be handled in the context of a web request

    Harvest the metadata of a service and its layers Synchronize the metadata of a new or updated layer to the search

    engine Feed a gazetteer when a new layer is uploaded or updated Upload a spatial datasets to the server Create a new layer using a table join

  • HTTP request/response cycle must be fast In web applications the HTTP

    request/response cycle can be synchronous as long as there are very quick interactions between the client and the server

    unfortunately there are cases when the cycle become slower

    In these situations the best practice for a web application is to process asynchronously these tasks using a task queue

  • Task Queues

    Asynchronous processing in a web application can be delegated to a task queue, which is a system for parallel execution of tasks in a non-blocking fashion

  • Asynchronous processing model

  • Asynchronous processing model

    The asynchronous processing model is composed by services that produce processing tasks (producers) and by services which consume and process these tasks (consumers) accordingly

    A message queue is a broker which facilitates message passing by providing a protocol or interface which other services can access. Work can be distributed across threads or machines

    In the context of a web application the producer is the client application that creates messages based on the user interaction. The consumer is a daemon process that can consume the messages and run the needed process

  • Glossary Task Queue: a system for parallel execution of tasks in a non-blocking

    fashion Broker or Message Queue: provides a protocol or interface for messages

    exchanging between different services and applications Producer: the code that places the tasks to be executed later in the broker Consumer or Worker: takes tasks from the broker and process them Exchange: takes a message from a producer and route it to zero or more

    queues (messages routing)

    Tasks must be consumed faster than being produced. If not, add more workers

  • Use cases for task queues

    in web applications some process is taking too much time and must be processed asynchronously

    heterogeneous applications/services in a given system architecture need an easy way to reliably communicate between each other

    periodic operations (vs crontab) a way of parallelizing tasks in multi processors monitor processes and analyze failing tasks (and execute

    them again)

  • Typical use cases for a task queue in a web application

    Thumbnails generation Sending bulk email Fetching large amounts of data from APIs Performing time-intensive calculations Expensive queries Search engine index synchronization Interaction with another application/service Replacing cron jobs (backups, maintenance, etc)

  • Typical use cases for a task queue in a GIS Portal/SDI

    Upload a shapefile to the server (GeoNode) Thumbnails generation for layers and maps (GeoNode) OGC services harvesting (Harvard Hypermap) Geoprocessing operations Geospatial data maintenance

  • Producer, broker and consumer architecture










  • Message brokers implementations

    Most of them are open source!

    RabbitMQ (AMQP, STOMP, JMS) Apache ActiveMQ (STOMP, JMS) Amazon Simple Queue Service (JMS) Apache Kafka

    Several standard protocols:

    AMQP, STOMP, JMS, MSMQ (Microsoft .NET)

  • Tasks (Jobs) queues implementations

    Celery (RabbitMQ, Redis, Amazon SQS, Zookeeper) Redis Queue (Redis) Resque (Redis) Kue (Redis)

    And many others!

  • Celery asynchronous task queue based on distributed message

    passing focused on real-time operation, but supports scheduling

    as well the execution units, called tasks, are executed

    concurrently on a single or more worker servers it supports many message brokers (RabbitMQ, Redis,

    MongoDB, CouchDB, ...) written in Python but it can operate with other languages great integration with Django! great monitoring tools (Flower, django-celery-results)

  • RabbitMQ

    RabbitMQ is a message broker: it accepts and forwards messages

    most widely deployed open source broker (35k+ deployments)

    support many message protocols supported by many operating systems and

    languages Written in Erlang

  • Architecture of Celery/RabbitMQ

  • A real use case: Harvard HypermapHHypermap (Harvard Hypermap) Registry is a platform that manages OWS, Esri REST, and other types of map service harvesting, and orchestration and maintains uptime statistics for services and layers. Where possible, layers are cached by MapProxy.

    HHypermap provides thousands of remote layers to WorldMap users

  • Harvard HypermapWorldMap Architecture

  • HHypermap interface

  • Need for a task queue


  • Producer

    Is the code that places the tasks to be executed later in the broker

  • Celery messages

  • Consumer

    Takes tasks from the broker and process them in a worker

  • Replacing cron jobs

  • Replacing cron jobs

  • Workers and threads with htop

  • Monitoring

  • Monitoring a task

  • Thanks!

    Question and Answer