OpenML 2014

N E T W O R K E D S C I E N C E A N D M A C H I N E L E A R N I N G

J O A Q U I N VA N S C H O R E N ( T U / E ) , 2 0 1 4

#OpenML

1 6 1 0

G A L I L E O G A L I L E I D I S C O V E R S S A T U R N ’ S R I N G S

‘ S M A I S M R M I L M E P O E TA L E U M I B U N E N U G T TA U I R A S ’

How do you convince scientists to share their discoveries?

1 7 T H C E N T U R Y

J O U R N A L S Y S T E M

R E P U TA T I O N - B A S E D E C O N O M Y

N E T W O R K E D S C I E N C ET O D A Y

Online scholarly tools Share data, code impossible to print in journals Collect, organise, analyse all data Collaborate in real time with hundreds of scientists

S C A L I N G U P C O L L A B O R AT I O N

• Large-scale collaborations change the way we make discoveries

• Massively collaborative science

• Open data: mapping and mining

• Citizen science

D E S I G N E D S E R E N D I P I T Y

• Many scientists have complementary expertise

• Right expertise at the right time

• Ideas spark new ideas, questions get answered, data and tools reused in unexpected ways

• `Happy accidents’ common in large collaborations

D Y N A M I C D I S T R I B U T I O N O F L A B O R

• Scientists have complementary skills: generate ideas, experiment, analyse, interpret

• Right skills, resources, time at the right time

• Dramatically speeds up progress

• What is impossibly hard for one scientist is routine for another

S C A L I N G U P C O L L A B O R AT I O N

• Online tools: contribute any amount at any time

• Encourage small contributions

• Subtasks that can be attacked independently

• Rich, structured information commons

• Architecture of attention

• Honor code

How do you convince scientists to share their ideas, data, code?

M A S S I V E LY C O L L A B O R AT I V E S C I E N C E

P O LY M A T H S

P O LY M AT H P R O J E C T S

• Designed serendipity

• Broadcast question hoping that many minds may find a solution

• “find myself having thoughts I would not have had without some chance remark of another contributor”

• Dynamic division of labor

• Throwing out ideas, criticising, testing ideas, synthesising, reformulating, coordinating,…

W H Y S H A R E I D E A S ?

• Authorship: contributions clearly visible, self-reporting publication

• Visibility: earn respect from notable peers

• Scalability: over many projects, concentrate on where you have special insight and advantage

• Interaction: share ideas early (before others), ideas are quickly developed, corrected

O P E N D ATAS D S S

S L O A N D I G I TA L S K Y S U R V E Y


• Broadcast data, believing that many minds will ask unanticipated questions

• More data than single person can comprehend: challenge is asking the right questions


• Collect data, ask questions, mine the data

W H Y S H A R E D ATA ?

• Fame: releasing the data yields more citations: people more likely to build on it

• Funding: sharing data increases value of research to community as a whole, increasing chances of continued funding

C I T I Z E N S C I E N C EG A L A X Y Z O O

G A L A X Y Z O O


• Unexpected observations reported on forum.

• Accidental discovery of new classes of objects: green pea galaxies, passive red spirals, Hanny’s Voorwerp


• Huge task subdivided in many small tasks which can be easily learned

W H Y V O L U N T E E R ?

• Discovery: being the first to see a galaxy

• Progress: understanding universe, beating cancer,…

• Fun: gamification

• Learning: learning more about a science/topic

• Community: meeting like-minded people

M A C H I N E L E A R N I N G

• Good candidate for networked science

• Highly complex data, code, workflows, yet most work published in papers (graphs, pseudocode)

• Experiments are not shared online: impossible to build on prior work, start each time from scratch

• Low generalisability: studies contradict

• Low reproducibility: code, experiment details missing

• Place to share data in fine detail, and organise it to work more effectively, be more visible, collaborate, tackle hard problems

• Links to data available anywhere online, integrated in popular machine learning environments (WEKA, R, MOA, RapidMiner)

• Website to find data, code, results; discuss, compare, visualise

Data Tasks Flows Runs Studies

Demo

D ATA

F L O W S

TA S K S

TA S K S

TA S K S

R U N S

R U N S

R U N S : D ATA S E T S

R U N S : F L O W S

U N E X P E C T E D

Plugins

W E K A P L U G I N

M O A P L U G I N

R P L U G I N

1 . O P E R AT O R T O D O W N L O A D TA S K ( TA S K T Y P E S P E C I F I C )

R A P I D M I N E R

2 . S U B W O R K F L O W T H AT S O LV E S T H E TA S K , G E N E R AT E S R E S U LT S

3 . O P E R AT O R F O R U P L O A D I N G R E S U LT S

O P E N M L C O N N E C T

• Library for Java

• Package for R

• In progress: Module for Python

• In progress: Command-line tools

F O R S C I E N C E

D E S I G N E D S E R E N D I P I T Y

• `Impossible’ questions become possible by reusing prior experiments

• Answer routine questions in minutes

• Mine all collected results for patterns: meta-learning

• Browse all data for unexpected results

• Reuse code, data in novel ways

D Y N A M I C D I V I S I O N O F L A B O R

• Scientists can focus attention on important problems by adding data, collaborate with community

• Large collaborations: OpenML organizes all results to follow progress

• Benchmark studies: only run algorithms you know well, reuse all other results

• Students, citizen scientists can contribute data, runs through plugins

E X A M P L E : M E TA - Q S A R P R O J E C T

• Large amounts of QSAR data available

• Not known which machine learning techniques are best

• OpenML used to try many algorithms and learn when to use which techniques

• Applications in fighting malaria

B E Y O N D J O U R N A L S

• Enriches research output, linked to papers

• Freely accessible

• Organized online

• Low threshold for students

• Continuously updated

• Immensely detailed

• Reproducible

• Stimulates online discussion

• Diminishes publication bias

S C A L A B I L I T Y

• Easy to make small contributions: add data, code, run experiments using plugins, leave comments

• Split up complex studies: OpenML tasks

• Rich, structured data: all data, flows, runs, users linked. Keyword search, filters, SQL endpoint

• Data easily filtered: easy to focus on your interests

• Enforce scientific standards: task types, verifiability, server-side evaluations, clear attribution, honor code

F O R S C I E N T I S T S

M O R E T I M E

• OpenML assists in most routinizable work:

• Find code and data online

• Setup, run & organize experiments

• Relate to state-of-the-art (benchmarks)

• Annotate code and data

• Full log of your research

• Keep control of your data, code, experiments

• Follow experiments on the go (mobile devices)

M O R E K N O W L E D G E

• Your results linked to everybody else’s

• Larger, more general studies

• Answer more questions

• Mine all combined results

• Find unexpected results

• Interact with others on global scale, get help

• Collaborate with scientists from other fields

M O R E C R E D I T

M O R E C R E D I T

• Citation

• OpenML attributes data, flows, runs, tells others how to cite it

• More easy to find by others

• Altmetrics: track how often your work is reused

• Productivity: contribute efficiently to many studies

• Visibility: collaborate, climb leaderboards, self-publish (tweet)

• Funding: convincing way to make data open

• No publication bias: unexecpected results

F U T U R E W O R K

• OpenML studies: online representation of paper: data, code, runs, discussions,…

• Social layer: control visibility: public, friends, private

• Collaborative leaderboards: all top-3 contributors

• Discussion forum for unexpected results

• More data types, tasks

S P R E A D T H E W O R D , W O R K O P E N LY

#OpenML

Data & Analytics

OpenML 2014