15
1 Making your data lovely! Prioritising, cleaning, extraction, transformation, automation Pia Waugh Director of Gov 2.0 and Data Department of Finance Soon to be Prime Minister & Cabinet

Open data presentation on tools and automation

Embed Size (px)

Citation preview

Page 1: Open data presentation on tools and automation

1

Making your data lovely!Prioritising, cleaning, extraction, transformation, automation

Pia WaughDirector of Gov 2.0 and Data

Department of Finance

Soon to be Prime Minister & Cabinet

Page 2: Open data presentation on tools and automation

22

Key Benefits to the Public Service in Opening Data

• Efficiencies from proactively publishing common requests• Cheaper and more modular services delivery• Reduced regulatory burden through machine readable data supporting compliance and automated reporting• Better policy outcomes by leveraging cross-agency data• More consistency & less duplication across government• Improved opportunities to leverage innovation and collaboration (citizens, industry, other depts)• Opportunities to improve data quality through verifiable public contributions

Page 3: Open data presentation on tools and automation

33

Tips for ensuring benefits realisation of open data

• Adopt an approach of “data user and developer empathy”• Data publishing built into your BAU• Initial focus on data that supports you build capability• Consume your own data APIs (apps, datavis, BI, etc)• Ensure you consider:

• Quality – no one can use bad data, but perfect is enemy of the good• Currency – is it up to date? How often is it updated?• APIs – is it programmatically available?• Publishing – have you provided supporting materials (taxonomies)?• Discoverability – is it hosted or linked on data.gov.au?• Reusability – have you tested it with data users?• Licensing – Creative Commons By Attribution the default

• Automation wherever possible!

Page 4: Open data presentation on tools and automation

44

Data on the inside

• Do you know what data you have internally?• Are you considering all data types? • How embedded is data driven decision making?• How can you upskill the whole organisation?• Do you know what your external data needs are?• How are you measuring and monitoring success?

Data infrastructure to support your organisation should be extendable to support sharing/publishing

Page 5: Open data presentation on tools and automation

55

Rub a dub data

• If a machine can’t read it, a machine can’t make an API• Some data has specialised data formats, some commonalities

• Tabular, spatial, real time, unstructured, etc

• Most data comes from somewhere, use the source Luke!• Machines and humans have different needs

Page 6: Open data presentation on tools and automation

66

What you need is clean sheets

• Don’t merge cells. Sorting and other manipulations people may want to apply to your data assume that each cell belongs to one row and column.

• Don’t mix data and metadata (e.g. date of release, name of author) in the same sheet.• The first row of a data sheet should contain column headers. None of these headers should be

duplicates or blank. The column header should clearly indicate which units are used in that column, where this makes sense.

• The remaining rows should contain data, one datum per row. Don’t include aggregate statistics such as TOTAL or AVERAGE. You can put aggregate statistics in a separate sheet, if they are important.

• Numbers in cells should just be numbers. Don’t put commas in them, or stars after them, or anything else. If you need to add an annotation to some rows, use a separate column.

• Use standard identifiers: e.g. identify countries using ISO 3166 codes rather than names.• Don’t use only colour or other stylistic cues to encode information. If you want to colour cells

according to their value, use conditional formatting.• Leave the cell blank if a value is not available.• If you provide pivot tables, make sure the underlying data is available separately too.• If you also want to create a human-friendly presentation of the data, do so by creating another sheet

in the same workbook and referencing the appropriate cells in the canonical data sheet

http://www.clean-sheet.org/

Page 7: Open data presentation on tools and automation

77

Automate your reporting

http://ckan.org/2015/09/18/pyramids-pipelines-and-a-can-of-sweave-ckan-asia-pacific-meetup/

Page 8: Open data presentation on tools and automation

88

Automating updates

Automation involves system to system updates to save you time & money.

Three broad approaches:1. Write scripts to push or pull data updates using an API directly from

the source. Usually doesn’t require much data manipulation.2. Adopt a tool like Taverna, FME or Splunk to extract, clean/manipulate,

and then push data to the data.gov.au (CKAN/geoserver) API directly.3. Use the data.gov.au (CKAN) to schedule pull updates from your data,

but most agencies don’t do that as they prefer to push updates.

The data.gov.au team strongly encourage you to gain at least one geek in you data team so you can experiment with code and tools to best meet your needs.“With much help and encouragement from the support team at data.gov.au, we dipped our toes into the CKAN API waters. As a DotNet shop we were keen to limit the technology landscape and sought to automate the upload using DotNet. The CKAN API is refreshingly lightweight with a simple authentication process and messaging.” -- ABN Lookup TeamCode at https://github.com/datagovau/ckan-api-examples

Page 9: Open data presentation on tools and automation

99

Support

• http://toolkit.data.gov.au is updated regularly. Recent updates include:• How to automate data updates to data.gov.au with FME• Improved information on how to clean data• How to manage your own catalogue harvesting• Government data landscape to identify projects of use

• Open Data Community Forum – soon to be moved to analyticsspace• Talk to your colleagues across government(s)

• Other sources• Communities of interest: Data Science Meetup groups, Data

Analytics Centre of Excellence, Linked Data Working Group, National Statistical Service, etc

• GovHack Developers Kit: Become a data scientist in an hour, data tools, APIs, datavis, spatial, mashup techniques, statistical

Page 10: Open data presentation on tools and automation

1010

Quality – improve over time

The 5 Star Data Quality standard developed by Sir Tim Berners-Lee will be used on data.gov.au in the coming month or two to indicate data quality.Aim for quality web services.

API quality will also be looked at soon, including potentiallya 5 star API standard.

http://5stardata.info/en/

Page 11: Open data presentation on tools and automation

1111

Data integration and aggregation

• Challenging but great potential for improved policy/services.• Unit record sharing is complex, privacy concerns for personal data.• Personal unit record data is mostly useful to researchers, appropriate mechanisms with legal, technical, ethical constraints to access such data.• Data aggregated by common spatial boundaries is comparative across datasets and over time.• Unfortunately, data owners traditionally aggregate to boundaries that constantly change (electorates, postcodes, etc).• The Australian Statistical Geography Standard (ASGS) provides a consistent set of spatial boundaries that can be mapped to other needs.• Anonymisation on the fly APIs also provide mechanism for appropriate public/agency access to unit record level data (e.g. ABS.Stat)

http://statistical-data-integration.govspace.gov.au/

https://toolkit.data.gov.au/index.php?title=Definitions#Types_of_data

Page 12: Open data presentation on tools and automation

1212

data.gov.au

Free, cloud, scalable API enabled platform for hosting government data.

Staged approach1. Publishing (2013 – mid 2014)

Improving the functionality and ease of

publishing for agencies with training and

documentation

2. Value realisation (2014-2015)Providing useful front end tools for data.gov.au

including data visualisation and analysis tools.

Publishing quality data a pre-requisite.

3. Data quality (2014-2015)Looking at ways to provide agencies the ability

to accept iterative data improvements in a

verifiable way

Features• Support for tabular, spatial and data models• Options for hosting, linking or catalogue harvesting• Manual and automated publishing options• API access to government data• Easy to publish, download & interact• Use cases and site|data|org analytics• Data Request Site• Metadata harvesting from gov data gateways• National Map integration• Federated search for discoverability

In Planning• 5 star quality plugin• Selective crowdsourcing for updates• League Table

Page 13: Open data presentation on tools and automation

1313

Open Data Portals

Council Portals:• City of Melbourne• City of Brisbane

Page 14: Open data presentation on tools and automation

1414

Some Case Studies

• Publishing Budget 2014 Data Report• Open data – Transforming the Provider / Stakeholder Paradigm• On the Value of Open Roof Prints• 100 years of patent and IP data released on data.gov.au

More available along with tech support at http://toolkit.data.gov.au

Other Australian case studies/documentation• SA Open Data Toolkit• QLD Government Case Studies• Victorian Government Showcase• NSW Apps Showcase• ACT examples

Page 15: Open data presentation on tools and automation

1515

The future is here....And it is already widely distributed

http://www.flickr.com/photos/mr_matt/3568892622/

Challenge #1: CollaborateChallenge #2: ShareChallenge #3: MeasureChallenge #4: Play

Questions?@piawaugh

@datagovaudata.gov.au

toolkit.data.gov.au