Autodiscovery or The long tail of open data

Preview:

Citation preview

Autodiscovery

or

The long tail of open data

Christopher Gutteridge

University of Southampton

& data.ac.uk

Bragsheet

Christopher Gutteridge - @cgutteridge• Previously; Lead Developer of EPrints

(Open access research repository software).

• “Linked Open Data Architect” for University of Southampton.

(or whatever we’re currently call doing LOD stuff for an organisation)

• Benevolent technical dictator of data.ac.uk(recently deposed)

• Webmaster WWW2006• Assistant Webmaster WWW2007, WWW2009

Image Attributions:• Backgrounds:

– http://www.fansshare.com/gallery/photos/14646865/abstract-background-brown-and-blue-circles/

– http://www.pptback.com/old-machine-gears-pptbackground.html

• Cliff leap pic: Justin De La Ornellas @ Flickr• Train tracks: duncanh1 @ Flickr• Lego bricks: rawdonfox @ Flickr• Mechano Box: Lady alys @ Wikipedia• Stickle Bricks: Simon Jobling @ Flickr• Free Universal Construction Kit: F.A.T. Lab + Sy-Lab.• Telescope: Brongaeh @ Flickr• Pinata: Peasap @ Flickr• Containers: l2f1 @ Flickr

Why don’t organisations

share data?

(and what stops them)

Us early adopters have shared data because it’s cool.

We were not 100% clear on the benefits but it looks like fun and maybe gains us reputation.

Fear. Uncertainty. Doubt.

Open Data Excuse Bingo

Terrorists will use it

We'll get spam It's too big It's not very interesting

Thieves will use it

I don't mind, but someone

else might

We will get too many enquiries

Lawyers want a custom License

There's no API Poor Quality There's already a project to...

We might want to use it in a

paper

It's too complicated

Data Protection People may misinterpret

the data

What if we want to sell it

later

Don’t get depressed! Go here for antidotes: http://is.gd/odbingo

Menu

Burger ….. £3.50Chips ….. £1.50 ≠

Greater than the sum of

its parts

Interoperable datasets

allow results that are

greater than the sum

of the parts…

11

bu

http://bus.southampton.ac.uk/

13

14

15

16

http://www.minecraftworldmap.com/worlds/xO3X4/full#/4469/64/-1806/-3/0/0

data.southampton.ac.uk

DiscreteFacts

Statistitics

What I want from data

• Where am I going?

• How can I get there?

• Where can I get a coffee enroute?

Why aren’t they using

our data?

“If you build it, they will come.”

“If you build it, they will come.”

Value of dataset to audienceX

Potential audience sizeX

Ease of discoveryX

Ease of grasping the value of the datasetX

Ease of exploiting dataset

Probability of open dataset reuse =

Value of dataset to audienceX

Potential audience sizeX

Ease of discoveryX

Ease of grasping the value of the datasetX

Ease of exploiting datasetX

Perceived quality & reliability

Probability of open dataset reuse =

…Autodiscoverable

and interoperable data

can massively increase

the potential audience

28

$ ./generate-world Demo --postcode PO381NL --size 250

29

$ ./generate-world Demo --postcode PO381NL --size 250

30

data.ac.uk

• Automatically discovers equipment data from all .ac.uk sites

– 2769 websites

– 42 providing data

– 11,028 records

• Automation massively reduces staffing costs

• Low effort for institutions-

– A third just provide a well-structured spreadsheet!

• Not a single-point-of-failure

32

.ac.uk

33

UK National Equipment Portal

34http://equipment.data.ac.uk

UNIQUIP

Column Heading Required

Type No

NameAt least one of these fields must be completed.

Description

Related Facility ID No

Technique(:cpv) or (:N8) No

Location No

Contact Name No

Contact Telephone

At least one of these fields must be completed.Contact URL

Contact Email

Secondary Contact Name No

Secondary Contact TelephoneAt least one of these fields must be completed with second contact name.

Secondary Contact URL

Secondary Contact Email

ID No

Photo No

Department No

Site Location Yes

Building No

Service Level No

Web Address No35

36

.ac.uk

Doin’ it on the cheap

37

Doin’ it on the cheap

38

Ensuring a sustainable

service through

autodiscovery

39

Sustainability via Autodiscovery

• How do we add new datasets?

• How are changes made?

• How do we know the data is open data?

Sustainability via Autodiscovery

• Have a machine readable document

describing the institution and any open

datasets (with licences)

• Place a link to it on the Institutions homepage

/.well-known/openorg

http://www.soton.ac.uk/.well-known/openorg

or

<link rel=“openorg” href=“http://id.southampton.ac.uk/dataset/profile/latest”>

/.well-known/openorg

http://www.soton.ac.uk/.well-known/openorg

or

<link rel=“openorg” href=“http://id.southampton.ac.uk/dataset/profile/latest”>

What is an Organisation Profile Document,

44

A RDF Document that describes the organisation:

– General information provided:

• Official name, Postal address, Contact phone number,The correct logo,

Physical location

– Links to the parts of the organisation,

• Admissions, Alumni, Freedom of Information, Complaints

– A semantic sitemap

• Key pages such as jobs, news, events…

– Links to the organisation’s discoverable open data sets and APIs

• The equipment dataset

What is an Organisation Profile Document,

45

46

Autodiscovery

47

Autodiscovery

48

• Dataset publicly available on website.

• Dataset has to be added manually along with all the institutions details,

contacts etc

Requires staff time (especially if any dataset changes location)

Autodiscovery

49

• Dataset publicly available on website.

• Dataset has to be added manually along with all the institutions details,

contacts etc

Requires staff time (especially if any dataset changes location)

• Organisation has an OPD linking to dataset

• The OPD has to be added manually, but the dataset location and

institution info is consumed directly from the OPD.

Requires less staff time (as any changes made to OPD will get updated)

Autodiscovery

50

• Dataset publicly available on website.

• Dataset has to be added manually along with all the institutions details,

contacts etc

Requires staff time (especially if any dataset changes location)

• Organisation has an OPD linking to dataset

• The OPD has to be added manually, but the dataset location and

institution info is consumed directly from the OPD.

Requires less staff time (as any changes made to OPD will get updated)

• Link to OPD from organisation’s home page

• OPD autodiscovered, so the dataset is automatically added to the

service.

Requires no staff time (as data is autodiscovered)

Never appeal to a man’s “better nature.” He may not have one.

Invoking his “self—interest” gives you more leverage.

- Robert Heinlein, “The Notebooks of Lazarus Long”

Status Report – Contributors and data statistics

52

Bronze Silver Gold

Data is on the internet and in an acceptable format.

✔ ✔ ✔

Description of dataset is provided by a remotely hosted OPD

✔ ✔

The OPD is discovered via autodiscovery.

The OPD/dataset has a recognised and supported open licence (eg CCO, ODCA or OGL)

53

Bronze Silver Gold

Data is on the internet and in an acceptable format.

✔ ✔ ✔

Description of dataset is provided by a remotely hosted OPD

✔ ✔

The OPD is discovered via autodiscovery.

The OPD/dataset has a recognised and supported open licence (eg CCO, ODCA or OGL)

All items in the dataset are assigned an ID code which is unique within theassigning organisation.

54

Exploiting profile

documents

Exploiting profile documents

• We’ve barely begun

• Lets try a live demo....

Warning:

Metaphor mixing detected

63

Needless heterogeneity means research doesn’t join up.

Aligning datasets every timecosts too much.

Tools can’t be reused

So what do we do about it?

Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.

Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.

The solutions need to be discoverable.

Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.

The solutions need to be discoverable.

Just putting it on Github is not making a tool discoverable!

Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.

The solutions need to be discoverable.

Just putting it on Github is not making a tool discoverable!

https://github.com/cgutteridge/

Organisation Datasets

Well known formats available for:

• Events

• Publications

• News headlines

Nothing in common use for:

• Staff Expertise

• Programmes of Events

• Vacancies

• Organisational Structure

• Buildings, Rooms

• Points of service

• Products– Food Menus

RDF or XML Vocabularies don’t solve the problem

by themselves.

You need:

Examples to copy.

Tools which consume and produce the format.

Online checking tools.

A dataset should at least solve one

usecase.

Over modelling is fun.

Stop it.

• TODO:

• OPD DOCUMENTATION

Thank-you.

Christopher GutteridgeUniversity of Southampton@cgutteridgecjg@ecs.soton.ac.ukhttp://opd.data.ac.uk/