76
Autodiscovery or The long tail of open data Christopher Gutteridge University of Southampton & data.ac.uk

Autodiscovery or The long tail of open data

Embed Size (px)

Citation preview

Page 1: Autodiscovery or The long tail of open data

Autodiscovery

or

The long tail of open data

Christopher Gutteridge

University of Southampton

& data.ac.uk

Page 2: Autodiscovery or The long tail of open data

Bragsheet

Christopher Gutteridge - @cgutteridge• Previously; Lead Developer of EPrints

(Open access research repository software).

• “Linked Open Data Architect” for University of Southampton.

(or whatever we’re currently call doing LOD stuff for an organisation)

• Benevolent technical dictator of data.ac.uk(recently deposed)

• Webmaster WWW2006• Assistant Webmaster WWW2007, WWW2009

Page 3: Autodiscovery or The long tail of open data

Image Attributions:• Backgrounds:

– http://www.fansshare.com/gallery/photos/14646865/abstract-background-brown-and-blue-circles/

– http://www.pptback.com/old-machine-gears-pptbackground.html

• Cliff leap pic: Justin De La Ornellas @ Flickr• Train tracks: duncanh1 @ Flickr• Lego bricks: rawdonfox @ Flickr• Mechano Box: Lady alys @ Wikipedia• Stickle Bricks: Simon Jobling @ Flickr• Free Universal Construction Kit: F.A.T. Lab + Sy-Lab.• Telescope: Brongaeh @ Flickr• Pinata: Peasap @ Flickr• Containers: l2f1 @ Flickr

Page 4: Autodiscovery or The long tail of open data
Page 5: Autodiscovery or The long tail of open data

Why don’t organisations

share data?

(and what stops them)

Page 6: Autodiscovery or The long tail of open data

Us early adopters have shared data because it’s cool.

We were not 100% clear on the benefits but it looks like fun and maybe gains us reputation.

Page 7: Autodiscovery or The long tail of open data

Fear. Uncertainty. Doubt.

Page 8: Autodiscovery or The long tail of open data

Open Data Excuse Bingo

Terrorists will use it

We'll get spam It's too big It's not very interesting

Thieves will use it

I don't mind, but someone

else might

We will get too many enquiries

Lawyers want a custom License

There's no API Poor Quality There's already a project to...

We might want to use it in a

paper

It's too complicated

Data Protection People may misinterpret

the data

What if we want to sell it

later

Don’t get depressed! Go here for antidotes: http://is.gd/odbingo

Page 9: Autodiscovery or The long tail of open data

Menu

Burger ….. £3.50Chips ….. £1.50 ≠

Page 10: Autodiscovery or The long tail of open data

Greater than the sum of

its parts

Page 11: Autodiscovery or The long tail of open data

Interoperable datasets

allow results that are

greater than the sum

of the parts…

11

Page 12: Autodiscovery or The long tail of open data

bu

http://bus.southampton.ac.uk/

Page 13: Autodiscovery or The long tail of open data

13

Page 14: Autodiscovery or The long tail of open data

14

Page 15: Autodiscovery or The long tail of open data

15

Page 16: Autodiscovery or The long tail of open data

16

Page 17: Autodiscovery or The long tail of open data
Page 18: Autodiscovery or The long tail of open data

http://www.minecraftworldmap.com/worlds/xO3X4/full#/4469/64/-1806/-3/0/0

Page 19: Autodiscovery or The long tail of open data

data.southampton.ac.uk

Page 20: Autodiscovery or The long tail of open data

DiscreteFacts

Statistitics

Page 21: Autodiscovery or The long tail of open data

What I want from data

• Where am I going?

• How can I get there?

• Where can I get a coffee enroute?

Page 22: Autodiscovery or The long tail of open data
Page 23: Autodiscovery or The long tail of open data

Why aren’t they using

our data?

Page 24: Autodiscovery or The long tail of open data

“If you build it, they will come.”

Page 25: Autodiscovery or The long tail of open data

“If you build it, they will come.”

Page 26: Autodiscovery or The long tail of open data

Value of dataset to audienceX

Potential audience sizeX

Ease of discoveryX

Ease of grasping the value of the datasetX

Ease of exploiting dataset

Probability of open dataset reuse =

Page 27: Autodiscovery or The long tail of open data

Value of dataset to audienceX

Potential audience sizeX

Ease of discoveryX

Ease of grasping the value of the datasetX

Ease of exploiting datasetX

Perceived quality & reliability

Probability of open dataset reuse =

Page 28: Autodiscovery or The long tail of open data

…Autodiscoverable

and interoperable data

can massively increase

the potential audience

28

Page 29: Autodiscovery or The long tail of open data

$ ./generate-world Demo --postcode PO381NL --size 250

29

Page 30: Autodiscovery or The long tail of open data

$ ./generate-world Demo --postcode PO381NL --size 250

30

Page 31: Autodiscovery or The long tail of open data

data.ac.uk

Page 32: Autodiscovery or The long tail of open data

• Automatically discovers equipment data from all .ac.uk sites

– 2769 websites

– 42 providing data

– 11,028 records

• Automation massively reduces staffing costs

• Low effort for institutions-

– A third just provide a well-structured spreadsheet!

• Not a single-point-of-failure

32

.ac.uk

Page 33: Autodiscovery or The long tail of open data

33

Page 34: Autodiscovery or The long tail of open data

UK National Equipment Portal

34http://equipment.data.ac.uk

Page 35: Autodiscovery or The long tail of open data

UNIQUIP

Column Heading Required

Type No

NameAt least one of these fields must be completed.

Description

Related Facility ID No

Technique(:cpv) or (:N8) No

Location No

Contact Name No

Contact Telephone

At least one of these fields must be completed.Contact URL

Contact Email

Secondary Contact Name No

Secondary Contact TelephoneAt least one of these fields must be completed with second contact name.

Secondary Contact URL

Secondary Contact Email

ID No

Photo No

Department No

Site Location Yes

Building No

Service Level No

Web Address No35

Page 36: Autodiscovery or The long tail of open data

36

.ac.uk

Page 37: Autodiscovery or The long tail of open data

Doin’ it on the cheap

37

Page 38: Autodiscovery or The long tail of open data

Doin’ it on the cheap

38

Page 39: Autodiscovery or The long tail of open data

Ensuring a sustainable

service through

autodiscovery

39

Page 40: Autodiscovery or The long tail of open data

Sustainability via Autodiscovery

• How do we add new datasets?

• How are changes made?

• How do we know the data is open data?

Page 41: Autodiscovery or The long tail of open data

Sustainability via Autodiscovery

• Have a machine readable document

describing the institution and any open

datasets (with licences)

• Place a link to it on the Institutions homepage

Page 42: Autodiscovery or The long tail of open data

/.well-known/openorg

http://www.soton.ac.uk/.well-known/openorg

or

<link rel=“openorg” href=“http://id.southampton.ac.uk/dataset/profile/latest”>

Page 43: Autodiscovery or The long tail of open data

/.well-known/openorg

http://www.soton.ac.uk/.well-known/openorg

or

<link rel=“openorg” href=“http://id.southampton.ac.uk/dataset/profile/latest”>

Page 44: Autodiscovery or The long tail of open data

What is an Organisation Profile Document,

44

A RDF Document that describes the organisation:

– General information provided:

• Official name, Postal address, Contact phone number,The correct logo,

Physical location

– Links to the parts of the organisation,

• Admissions, Alumni, Freedom of Information, Complaints

– A semantic sitemap

• Key pages such as jobs, news, events…

– Links to the organisation’s discoverable open data sets and APIs

• The equipment dataset

Page 45: Autodiscovery or The long tail of open data

What is an Organisation Profile Document,

45

Page 46: Autodiscovery or The long tail of open data

46

Page 47: Autodiscovery or The long tail of open data

Autodiscovery

47

Page 48: Autodiscovery or The long tail of open data

Autodiscovery

48

• Dataset publicly available on website.

• Dataset has to be added manually along with all the institutions details,

contacts etc

Requires staff time (especially if any dataset changes location)

Page 49: Autodiscovery or The long tail of open data

Autodiscovery

49

• Dataset publicly available on website.

• Dataset has to be added manually along with all the institutions details,

contacts etc

Requires staff time (especially if any dataset changes location)

• Organisation has an OPD linking to dataset

• The OPD has to be added manually, but the dataset location and

institution info is consumed directly from the OPD.

Requires less staff time (as any changes made to OPD will get updated)

Page 50: Autodiscovery or The long tail of open data

Autodiscovery

50

• Dataset publicly available on website.

• Dataset has to be added manually along with all the institutions details,

contacts etc

Requires staff time (especially if any dataset changes location)

• Organisation has an OPD linking to dataset

• The OPD has to be added manually, but the dataset location and

institution info is consumed directly from the OPD.

Requires less staff time (as any changes made to OPD will get updated)

• Link to OPD from organisation’s home page

• OPD autodiscovered, so the dataset is automatically added to the

service.

Requires no staff time (as data is autodiscovered)

Page 51: Autodiscovery or The long tail of open data

Never appeal to a man’s “better nature.” He may not have one.

Invoking his “self—interest” gives you more leverage.

- Robert Heinlein, “The Notebooks of Lazarus Long”

Page 52: Autodiscovery or The long tail of open data

Status Report – Contributors and data statistics

52

Page 53: Autodiscovery or The long tail of open data

Bronze Silver Gold

Data is on the internet and in an acceptable format.

✔ ✔ ✔

Description of dataset is provided by a remotely hosted OPD

✔ ✔

The OPD is discovered via autodiscovery.

The OPD/dataset has a recognised and supported open licence (eg CCO, ODCA or OGL)

53

Page 54: Autodiscovery or The long tail of open data

Bronze Silver Gold

Data is on the internet and in an acceptable format.

✔ ✔ ✔

Description of dataset is provided by a remotely hosted OPD

✔ ✔

The OPD is discovered via autodiscovery.

The OPD/dataset has a recognised and supported open licence (eg CCO, ODCA or OGL)

All items in the dataset are assigned an ID code which is unique within theassigning organisation.

54

Page 55: Autodiscovery or The long tail of open data
Page 56: Autodiscovery or The long tail of open data
Page 57: Autodiscovery or The long tail of open data

Exploiting profile

documents

Page 58: Autodiscovery or The long tail of open data

Exploiting profile documents

• We’ve barely begun

• Lets try a live demo....

Page 59: Autodiscovery or The long tail of open data
Page 60: Autodiscovery or The long tail of open data
Page 61: Autodiscovery or The long tail of open data
Page 62: Autodiscovery or The long tail of open data

Warning:

Metaphor mixing detected

Page 63: Autodiscovery or The long tail of open data

63

Needless heterogeneity means research doesn’t join up.

Aligning datasets every timecosts too much.

Tools can’t be reused

Page 64: Autodiscovery or The long tail of open data

So what do we do about it?

Page 65: Autodiscovery or The long tail of open data
Page 66: Autodiscovery or The long tail of open data

Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.

Page 67: Autodiscovery or The long tail of open data

Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.

The solutions need to be discoverable.

Page 68: Autodiscovery or The long tail of open data

Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.

The solutions need to be discoverable.

Just putting it on Github is not making a tool discoverable!

Page 69: Autodiscovery or The long tail of open data

Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.

The solutions need to be discoverable.

Just putting it on Github is not making a tool discoverable!

https://github.com/cgutteridge/

Page 70: Autodiscovery or The long tail of open data

Organisation Datasets

Well known formats available for:

• Events

• Publications

• News headlines

Nothing in common use for:

• Staff Expertise

• Programmes of Events

• Vacancies

• Organisational Structure

• Buildings, Rooms

• Points of service

• Products– Food Menus

Page 71: Autodiscovery or The long tail of open data
Page 72: Autodiscovery or The long tail of open data

RDF or XML Vocabularies don’t solve the problem

by themselves.

You need:

Examples to copy.

Tools which consume and produce the format.

Online checking tools.

Page 73: Autodiscovery or The long tail of open data

A dataset should at least solve one

usecase.

Over modelling is fun.

Stop it.

Page 74: Autodiscovery or The long tail of open data

• TODO:

• OPD DOCUMENTATION

Page 75: Autodiscovery or The long tail of open data
Page 76: Autodiscovery or The long tail of open data

Thank-you.

Christopher GutteridgeUniversity of Southampton@[email protected]://opd.data.ac.uk/