75
GeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) Dr. Thierry Badard, CTO [email protected] Spatialytics http://www.spatialytics.com FOSS4G 2011 Workshop, Denver, CO, USA, September 12, 2011

GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

  • Upload
    doanh

  • View
    231

  • Download
    3

Embed Size (px)

Citation preview

Page 1: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

GeoKettle: A powerful spatial ETL tool for feeding your Spatial Data

Infrastructure (SDI)

Dr. Thierry Badard, [email protected]

Spatialyticshttp://www.spatialytics.com

FOSS4G 2011 Workshop, Denver, CO, USA, September 12, 2011

Page 2: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Preamble

• These slides constitute the training material used for the GeoKettle workshop given by Spatialytics during the FOSS4G 2011 conference

• They are available online in PDF format:– http://www.spatialytics.org/files/foss4g2011/geokettle-workshop.pdf

• They are released under the terms of the Creative Commons CC-BY-SA license.

Page 3: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Contents

• What is GeoKettle?• Basic features of GeoKettle• Installing GeoKettle• Spatial features of GeoKettle• Practical learning: Exercises• Conclusion

Page 4: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

• It is an open source Spatial ETL tool

• It is part of the geospatial BI software stack developed initially by the GeoSOA research group at Laval University in Quebec …

• But are now developed and supported by Spatialytics– http://www.spatialytics.org (open source community)

– http://www.spatialytics.com (professional support, training & services but also Enterprise Editions which include support)

• The stack comprises:– GeoKettle– GeoMondrian– SOLAPLayers/GeoBIExt /

What is GeoKettle?

Page 5: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

• Want to know more about GeoBI and what this type of application can do for you?– Please attend my presentation entitled “Building

professional geo-analytical dashboards and reports with GeoBIExt”Time slot: Friday - 11:00am - 11:30amRoom: Denver

• In this workshop, we will focus on GeoKettle capabilities and how it can facilitate your every day life while playing with geospatial data, SDI, web services, GIS formats, spatial databases, ...

What is Geospatial BI (GeoBI)?

Page 6: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

What is an ETL tool?

• A type of software used to populate databases or data warehouses from heterogeneous data sources

• ETL stands for:– Extract – Extract data from data sources– Transform – Transformation of data in order to correct

errors, make some data cleansing, change the data structure, make them compliant to defined standards, etc.

– Load – Load transformed data into a target DBMS, service, file format ...

• An ETL tool should manage the insertion of new data and the updating of existing data

• Should be able to perform transformations from:– A OLTP system to another OLTP system– A OLTP system to analytical data warehouse

Page 7: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Why use an ETL tool?

• Automation of complex and repetitive data processing without producing any specific code

• Conversion between various data formats

• Migration of data from a DBMS to another

• Data feeding into various DBMS

• Population of analytical data warehouses for decision support purposes

• etc.

Page 8: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

GeoKettle

• A "spatially-enabled" version of Pentaho Data Integration (Kettle)

• Kettle is a metadata-driven ETL with direct execution of transformations– No intermediate code generation!

• Kettle supports several DBMS and file formats– DBMS support: MySQL, PostgreSQL, Oracle, DB2, MS

SQL Server, ... (total of 37)– Read/write support of various data file formats: text,

Excel, Access, DBF, XML, …– Various services/systems: LDAP, CRM, ...

• Numerous transformation steps– A transformation is built in a GUI and can be seen as a

chain of transformation steps

• Methods for the updating of databases and DW

Page 9: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

GeoKettle

• GeoKettle provides a true and consistent integration of the spatial component– All steps provided by Kettle are able to deal with geospatial

data types– Some geospatial dedicated steps have been added (SRS,

SOS, CSW, Spatial Analysis, …)– Allow then powerful integration of corporate + spatial data

• First release in May 2008: 2.5.2-20080531• Version 3.2.0-20090706 on July 2009• Current stable version: 2.0 stable (Sept. 2011)• Released under LGPL• Used in different organizations and countries:

– Some ministries, public bodies, utilities, bank, insurance, integrators, …

• A growing community of users and contributors

Page 10: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

GeoKettle – Online ressources

• GeoKettle project page

http://www.spatialytics.org/projects/geokettle/

Shortcut: http://www.geokettle.org

• GeoKettle documentation (wiki)

http://wiki.spatialytics.org/doku.php?id=projects:geokettle

• GeoKettle forum

http://www.spatialytics.com/forum

• GeoKettle Trac

http://trac.spatialytics.com/geokettle

• GeoKettle plugins

http://trac.spatialytics.com/geokettle/wiki/Plugins

Page 11: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Introduction to basic featuresof GeoKettle

Page 12: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Transformations (1/3)

• The ETL processes are named transformations

• Elements of a transformation are steps• Links between steps are hops• Parallel execution (threads) of steps

steps

hops

Page 13: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Transformations (2/3)

• Steps have configuration parameters (double-click the step icon to open the dialog box):– DB connection– Filename to open– Query filter– Source code of a script (JavaScript)– ...

• Steps categories:– input– output– transformations– flow– scripting– ...

Page 14: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Transformations (3/3)• hops link steps between them and define the

data flow• To create a hop: drag and drop from a step to

another with the middle button of the mouse pressed (or Shift+left button)

• In a hop:– data flows from the output of a step to the input of the

next step, row by row– fields definition (number, names & types) is always

the same from one row to another• Different hop types:

copy distribute Conditional output

Page 15: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Jobs

• A job defines a series of job entries (tasks) to run sequentially

• These tasks can be some:– transformations– SQL queries– file operations (copy, delete, upload, etc.)– conditional tests– scripts (shell, JavaScript)– e-mailing operations (send / receive emails)– others jobs– etc.

Page 16: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

The different GeoKettle tools

• Spoon: GUI for the edition of transformations and jobs

• Pan: command line interface for running transformations

• Kitchen: command line interface for running jobs

• Carte: Web service for the remote execution of transformations and jobs– Allow to expose and run the transformation and data

integration processes as web services ...– Remote execution and running transformations in a

cluster environment (i.e. in the cloud)

Page 17: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Repository

• Transformations and jobs are usually saved in XML files (.ktr/.kjb)

• Alternatively, they can be saved in a database repository and hence be and shared between users more easily– Transformations, jobs and connection

parameters to DBMS are stored in a dedicated database

– See the first pop-up window when running GeoKettle

• Enable the preservation/centalisation of knowledge about data integration processes inside the company

Page 18: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Installing & compiling GeoKettle

Page 19: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Compiling GeoKettle?

• To get all the latest features of GeoKettle– Get the source code and compile GeoKettle!

• Requirements: – Subversion Client (Eclipse Subversive or Tortoise SVN)– Java JDK version 5 or higher– Apache Ant (http://ant.apache.org)

• 3 steps:% svn co http://dev.spatialytics.com/svn/geokettle­2.0/trunk geokettle

% cd geokettle

% ant

Optionally:

% ant zip to build a binary distribution archive of GeoKettle

% ant zip­plugins to build a binary distribution archive of GeoKettle including selected plugins.

Page 20: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Installation procedure

• Available (2.0-RC1) on OSGeo Live DVD but we will use the 2.0 stable version in the workshop

• Very simple installation procedure without the installer– See documentation on GeoKettle wiki

• Even more simple with the new installer!

• Prerequisites: – All you need is a Java Runtime Environment

– Version 5.0 or higher

• Start the OSGeo Live Virtual Machine (if not already done)

• Download and start the installer inside the VM:– http://sourceforge.net/projects/geokettle/files/geokettle-2.x/2.0/

• When done, double click the GeoKettle icon on the desktop to run it– Please wait for instructions when first window (repository selection)

pops up!

Page 21: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Spatial features of GeoKettle

Page 22: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Transparent spatial support

• Consistent and transparent integration of the geometry data types:

– Vector geometry (based on JTS – point-line-polygon model)

– Transparent conversions between data types:

• Geometry String: from and to WKT• Geometry Binary: from and to WKB

– Native I/O support for some spatial DBMS (via JDBC or through GDAL/OGR)

Page 23: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Inputs / outputs

• Read/write support:

– Spatial DBMS:• PostgreSQL/PostGIS (native)• MySQL spatial (native)• Oracle Spatial / Locator (native)• ESRI personal geodatabse*, Ingres*, Informix datablade*,

ArcSDE*, SQLite/SpatiaLite (through GDAL/OGR)* requires valid licenses and GDAL/OGR re-compilation

• MS SQL Server 2008, IBM DB2, … (non native, requires hints)

– GIS file formats:• ESRI ShapeFile, GML 3.1.1, KML 2.2• And all GIS file formats provided by GDAL/OGR

– Arc/Info, GeoJSON, GeoConcept, GeoRSS, GML 2.x, GPX, KML 2.0, ...

Page 24: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Inputs / outputs

• Read/write support:

– Geospatial web services:• CSW• SOS (read only)• No dedicated steps yet but possible:

– WFS, WMS, WPS, … – We will see how in this workshop! ;-)

• On the fly preview/geopreview

– Allow to know if a transformation produces the expected results on a smaller dataset

– Offer different widget: Pan, zoom, Get object attributes, symbolization (color, opacity, ...)

– Can preview streams with more than one geometry column

Page 25: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Spatial analysis

• Accessing and processing Geometry objects in JavaScript– Base on Mozilla Rhino (http://www.mozilla.org/rhino)

– It allows the definition of custom transformation steps by the user (“Modified Javascript Value” step)

– JTS (Java Topology Suite) and Sextante API fully available!

– JCS (Java Conflation Suite) processing capabilities should be available soon …

• Spatial analysis functions– Topological predicates: intersects, touches, within, …

• Join and Filtering steps

– Spatial functions: union, intersection, length, buffer, ...• Modified JavaScript Value, Spatial Analysis and Calculator steps

– Aggregative operators: union, geometry collection, boundingbox, …

• Group by step

– Advanced geoprocessing: delaunay, remove holes, simplify, smooth, ...

• Sextante plugin

Page 26: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

SRS & coordinates transformation

• Native support of Spatial Reference Systems (SRS) in metadata of the Geometry fields (based on GeoTools – referencing library)

• Coordinates transformation / Change of Spatial Reference System– SRS Transformation step

• Assign a SRS to a data flow– Set SRS step

• Reading and writing of SRS metadata– Read SRS from data source: Databases and GIS file

formats

– Validation of SRS when inserting data into PostGIS and Oracle

• Other DBMS do not support this feature yet!– Add the SRS info when writing data into GIS file

formats

(φ,λ)

(x,y)

(φ,λ)

(x,y)

(φ,λ)

(x,y)

Page 27: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Practical learning: Exercises!

Page 28: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Before beginning the exercises ...

• Start the OSGeo Live Virtual Machine (if not already done) and log in

• Download the archive containing data and solutions to the different exercises of this workshop– http://www.spatialytics.org/files/foss4g2011/geokettle-workshop.zip

– Unzip the archive on your Desktop

– It contains 3 sub directories:• data

– input– output

• solutions– transformations

» exercise_0 to exercise_9• transformations

• We are now ready!

Page 29: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 0

• We will do this first exercise all together, step by step in order to discover GeoKettle

• The aim of this exercise is to know how to load a ESRI shapefile into a PostGIS database and have it published properly in GeoServer

• In this exercise we will play with the following new steps:– Shapefile File Input– Set SRS– Select Values– Add sequence– Table Output

Page 30: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 0

• Design a transformation that:– Reads the Shapefile contained in the ontario_names_shp

data directory. It is a set of points that locate geo names for the whole Ontario province in Canada (source: Geobase, http://www.geobase.ca)

– Assigns the EPSG 4326 SRS code (WGS 84) to data– Filters the stream in order to preserve only the_geom,

GEONAME, FEATUREID, CONCISTERM, GENERITERM and REGIONNAME attributes

– Adds an identifier (numeric incremental id) to objects– Stores data into a geonames table of a geokettle database

on your PostgreSQL/PostGIS instance– Finally, publish it in GeoServer

Page 31: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 0 – Solution

Page 32: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercises

• From this point, do the exercises by yourself• Exercises are more and more difficult • The aim is not to follow step by step procedures

mentioned in exercises• We want you to become more and more

efficient/autonomous and aware on how to do some tasks in GeoKettle

• That's why instructions will be less and less detailed as we progress in the exercises

Page 33: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 1

• The aim of this exercise is to know how to perform some basic computation (compute area for poygons) with GeoKettle

• In this exercise we will play with the following new steps:– SRS Transformation– Calculator– Modified JavaScript Values

Page 34: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 1

• Based on the previous transformation, design a new one that:– Reads the Shapefile contained in the ontario_mrc_shp data directory. It is

a set of polygons that represents some counties in the Ontario province in Canada (source: Geobase, http://www.geobase.ca)

– Converts coordinates of data from WGS84 to NAD83 (CSRS) / UTM Zone 17N

– Computes the area of each polygon and add the value in a new field area_meters

– Converts by scripting area_meters values from m2 to km2 and stores this value in a new field named area

– Filters the stream in order to preserve only the_geom, COMMONAME1, LEGALNAME1, DESIGNATN attributes but renames them resp. as the_geom, name, county_name, designation

– Converts back coordinates to WGS84

– Adds an identifier (numeric incremental id) to objects

– Stores data into a municipalities table of a geokettle database on your PostgreSQL/PostGIS instance

– Finally, publish it in GeoServer

Page 35: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 1

• Runs this transformation in Spoon in order to test it• When finished, try to run it with the pan command

line tool

Page 36: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 1 – Solution

Page 37: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 1 - Solution

./pan.sh -file=”/home/user/Desktop/geokettle_workshop/solutions/ transformations/exercise_1/ex_1.ktr”

Page 38: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 2

• The aim of this exercise is to know:

– A way to perform some spatial selection over geospatial features in GeoKettle

– How to perform some data aggregation in order to compute statistics on data and export these stats in a MS Excel file

– How to create a job that enable to perform the two previous tasks sequentially

• In this exercise we will play with the following new steps/job entries:– Filter rows

– Join rows (cartesian product)

– OGR File Input

– Sort rows

– Group by

– Excel Output

– Transformation

Page 39: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 2 – Part 1

• Design a transformation that:– Reads data the previous municipalities table and extracts

the_geom and name fields as muni_geom and muni_name fields

– Filters rows in order to keep only the county of Durham

– In parallel, reads data form a mapinfo tab file located in the ontario_rrn_tab directory. It is an extract of the national road network stemming form Geobase.ca.

– Selects only roads that intersects the Durham county

– Sets the SRS of data to WGS84– Filters the stream in order to preserve only the_geom, ROADSEGID,

ROADCLASS, RTNUMBER1, RTENAME1EN attributes but renames them resp. as the_geom, id, class, number and name

– Adds an identifier (numeric incremental id) to objects

– Stores data into a roads table of a geokettle database on your PostgreSQL/PostGIS instance

– Finally, publish it in GeoServer

Page 40: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 2 – Part 2

• Design a transformation that:– Reads data in the previously created roads table– Converts coordinates of data from WGS84 to NAD83 (CSRS) /

UTM Zone 17N

– Computes by script only the length in km of each road segments and add the value in a new field named length

– Aggregates (sum) the values of length for each roads of a same class and stores the total value in a new field named total_length

– Finally, exports aggregated data into an Excel file

Page 41: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 2 – Job

• Design a job that performs the two previous tasks sequentially

• Run it into Sponn• But also, try to run it with the Kitchen

command line tool

Page 42: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 2 – Part 1: Solution

Page 43: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 2 – Part 2: Solution

Page 44: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 2 – Job: Solution

Page 45: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 2 - Solution

./kitchen.sh -file=”/home/user/Desktop/geokettle_workshop/solutions/ transformations/exercise_2/ex_2.kjb”

Page 46: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 3

• The aim of this exercise is to know how to:

– retrieve data from a WFS service

– perform some geo-processing operations with the Sextante plugin

– and export the result to two different file formats: KML and Mapinfo

• In this exercise we will play with the following new steps/job entries:– Sextante plugin

– OGR Output

– KML Output

– HTTP

Page 47: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 3 – Job

• Design a job that:– Requests municipalites data in GML 2 from the GeoServer WFS

hosted on your WM. Use the preview layer in GeoServer in order to retrieve the GET request to send.

– And runs a transformation that we will define in the next slide

Page 48: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 3 – Transformation

• Design a transformation that:– Reads the GML file extracted from the WFS– Removes holes from the polygons and stores the new

geometry of objects in a result_geom field– Filters the stream in order to preserve only the gml_id,

name, county_name, designation, area and result_geom fields

– Filters rows that have a valid and not null geometry– And stores the resulting stream in a KML file and a Mapinfo

MIF/MID file

Page 49: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 3 – Job: Solution

Page 50: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 3 – Transform.: Solution

Page 51: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 4

• The aim of this exercise is to know how to extract some POI from an OSM data file

• Listen to the instructor that will explain you how is structured a OSM data file

• In this exercise we will play with the following new steps:– Get data from XML

Page 52: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 4

• Design a transformation that:– Extracts POI data from the OSM data file located in the

ottawa_osm directory– Set the SRS of data to WGS84– And exports the result as an ESRI shapefile– Finally, publish it in GeoServer

Page 53: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 4 – Solution

Page 54: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 5

• The aim of this exercise is to know how to:– Extract sensor data from a SOS– Perform some spatial computation with the Spatial

Analysis step– Retrieves some metadata on the data stream– And push these metadata in a CSW

• Listen to the instructor that will explain you how to proceed with SOS and CSW steps

• In this exercise we will play with the following new steps:– SOS Input– Spatial Analysis– CSW Output

Page 55: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 5

• Design a transformation that:– Retrieves GAUGE_HEIGHT measures from the SOS service

given by the instructor– Removes rows where measure presents values <=30– Group rows by procedure– Compute the envelope of each resulting geometry– Retrieves and sets some mandatory metadata

(MD_METADATA profile)– And finally, publish the metadata in GeoNetwork

Page 56: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 5 – Solution

Page 57: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 6

• The aim of this exercise is to know how to harvest metadata from a CSW compliant service

• In this exercise we will play with the following new steps:– CSW Input– Dummy

Page 58: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 6

• Design a transformation that:– Harvest metadata from the geocat.ch online catalog– Filters metadata that deal with dataset– For each metadata row, computes by script the extent of

the dataset– And export the the BriefRecord_title, BriefRecord_type and

the extent in a new PostGIS table named meta_extent– Finally, publish this new table in GeoServer

Page 59: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 6 – Solution

Page 60: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 7

• The aim of this exercise is to know how to call a process hosted in a WPS compliant service

• In this exercise, we will create a new layer from our polygons layer (municipalities) hosted in GeoServer by applying on each polygon a Centroid WPS service

• In this exercise we will play with the following new steps entries:– Add constants– HTTP Client

Page 61: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 7 – Job

• Design a job that:– Requests municipalites data in GML 2 from the GeoServer WFS

hosted on your WM. Use the preview layer in GeoServer in order to retrieve the GET request to send.

– And runs a transformation that we will define in the next slide

Page 62: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 7 – Transformation

• Design a transformation that:– Reads the GML file extracted from the WFS– For each rows, call the Centroid service hosted in the Zoo

WPS instance on your VM– Stores the result in a new table named muninames in your

PostGIS DBMS instance.– Finally, publish it in GeoServer.

Page 63: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 7 – Job: Solution

Page 64: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 7 – Transform.: Solution

Page 65: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 8

• Based on exercise 4, design a transformation that extracts the road network from the Ottawa OSM data file

• In this exercise we will play with the following new steps:– Shapefile File Output

Page 66: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 8 – Solution

Page 67: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 9

• The aim of this exercise is to know how to:– Retrieve location information from some Twitter

tweets– Call the geonames gazetteer service in order to

retrieve lat/lon information for tweets that have no geo tag

• Listen to the instructor that will explain you how the twitter and geonames services work

• In this exercise we will play with the following new steps:– Unique rows (HashSet)– Generate rows

Page 68: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 9

• Design a transformation that:– Retrieves tweets mentioning the #foss4g tags– For each tweet, checks if there is a geo info present– If not, uses the location info and call the geoames.org

gazetteer in order to retrieve the lat/lon of this location– Stores the result in a new table named tweets in your

geokettle database in the PostGIS DBMS.– Finally, publish it in GeoServer

Page 69: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Exercise 9 – Solution

Page 70: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Conclusion

Page 71: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Upcoming features

• Versions 2.x will be the last versions of GeoKettle based on the Kettle 3.2 code base.

• Thanks to the tremendous work of the Kettle developers, future version of GeoKettle will be more pluggable with Kettle

• Hence, it will be possible to add spatial extensions provided by GeoKettle to any Kettle/PDI 4.x installation.

• Maximizing this architecture switch, we want to perform a re-engineering of the Geometry data type.

• At present, it only supports 2D data.

• We want to allow support for:– X,Y,Z,t and M data

– LiDAR data

– Linear referencing

– Raster data

Page 72: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Upcoming features

• So many tasks can be automated with GeoKette. • We can think about many new steps in future

releases ... • But, you know, the roadmap can be influenced by

opportunities ... • So, we are open to your ideas, opportunities and

possible sponsoring to have your required feature implemented

• Spatialytics can also provide:– Support (1st and 2nd line through partners)– Advanced training– Be your partner in tender– ...

Page 73: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Upcoming features• Additional non exhaustive list of steps/jobs that could be envisaged:

– Additional geometric data cleansing and geo-processing capabilities:

• inclusion of some JCS/OpenJump conflation & topology checking/cleansing capabilities (GPL -> plugin)

• Towards a geospatial data quality module to check and correct errors

– Read/write support for other DBMS, GIS file formats and services• NetCDF, SDMX, Linked Geodata, ...

• Native support for MS SQL Server 2008, Netezza spatial, NoSQL dbs, ...

• Native support for WFS-T, WPS, WMS, Table Joining Service (TJS), ...

– Dedicated steps: • Social media (Twitter, ...), OSM, cartograhic generalisation, geocoding &

reverse geocoding ...

– Direct publishing into GeoServer and MapServer

– But also why not see GeoKettle as a possible data source for this web servers ...

– Raster support: re-initiating the development of a plugin to integrate all raster capabilities provided by the Sextante library (BeETLe project)

Page 74: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

To learn more about GeoKettle

• Do not hesitate to:

– Visit our web sites• http://www.spatialytics.com• http://www.spatialytics.org

– Subscribe to the monthly Spatialytics eNews letter– Follow us on Twitter and Facebook– Check the documentation on the wiki– Post your questions on the forum– Submit a bug report or feature request on the

GeoKettle trac– Contact us

Page 75: GeoKettle: A powerful spatial ETL tool for feeding your ... · PDF fileGeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) ... – Load – Load

Questions

Contact info:

Dr. Thierry Badard, CTO

Spatialytics inc.

Quebec, Canada

Email: [email protected]

Web: http://www.spatialytics.org

http://www.spatialytics.com

Twitter: tbadard, spatialytics

http://www.geokettle.org Twitter : geokettle

http://www.geo-mondrian.org Twitter : geomondrian

http://www.solaplayers.org Twitter : solaplayer

http://www.geobiext.org Twitter : geobiext