Upload
joaquin-vanschoren
View
967
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Networked science in machine learning. OpenML: a collaborative online tool for sharing machine learning algorithms, data sets, experiments.
Citation preview
N E T W O R K E D S C I E N C E A N D M A C H I N E L E A R N I N G
J O A Q U I N VA N S C H O R E N ( T U / E ) , 2 0 1 4
#OpenML
1 6 1 0
G A L I L E O G A L I L E I D I S C O V E R S S A T U R N ’ S R I N G S
‘ S M A I S M R M I L M E P O E TA L E U M I B U N E N U G T TA U I R A S ’
How do you convince scientists to share their discoveries?
1 7 T H C E N T U R Y
J O U R N A L S Y S T E M
R E P U TA T I O N - B A S E D E C O N O M Y
N E T W O R K E D S C I E N C ET O D A Y
Online scholarly tools Share data, code impossible to print in journals Collect, organise, analyse all data Collaborate in real time with hundreds of scientists
S C A L I N G U P C O L L A B O R AT I O N
• Large-scale collaborations change the way we make discoveries
• Massively collaborative science
• Open data: mapping and mining
• Citizen science
D E S I G N E D S E R E N D I P I T Y
• Many scientists have complementary expertise
• Right expertise at the right time
• Ideas spark new ideas, questions get answered, data and tools reused in unexpected ways
• `Happy accidents’ common in large collaborations
D Y N A M I C D I S T R I B U T I O N O F L A B O R
• Scientists have complementary skills: generate ideas, experiment, analyse, interpret
• Right skills, resources, time at the right time
• Dramatically speeds up progress
• What is impossibly hard for one scientist is routine for another
S C A L I N G U P C O L L A B O R AT I O N
• Online tools: contribute any amount at any time
• Encourage small contributions
• Subtasks that can be attacked independently
• Rich, structured information commons
• Architecture of attention
• Honor code
How do you convince scientists to share their ideas, data, code?
M A S S I V E LY C O L L A B O R AT I V E S C I E N C E
P O LY M A T H S
P O LY M AT H P R O J E C T S
• Designed serendipity
• Broadcast question hoping that many minds may find a solution
• “find myself having thoughts I would not have had without some chance remark of another contributor”
• Dynamic division of labor
• Throwing out ideas, criticising, testing ideas, synthesising, reformulating, coordinating,…
W H Y S H A R E I D E A S ?
• Authorship: contributions clearly visible, self-reporting publication
• Visibility: earn respect from notable peers
• Scalability: over many projects, concentrate on where you have special insight and advantage
• Interaction: share ideas early (before others), ideas are quickly developed, corrected
O P E N D ATAS D S S
S L O A N D I G I TA L S K Y S U R V E Y
• Designed serendipity
• Broadcast data, believing that many minds will ask unanticipated questions
• More data than single person can comprehend: challenge is asking the right questions
• Dynamic division of labor
• Collect data, ask questions, mine the data
W H Y S H A R E D ATA ?
• Fame: releasing the data yields more citations: people more likely to build on it
• Funding: sharing data increases value of research to community as a whole, increasing chances of continued funding
C I T I Z E N S C I E N C EG A L A X Y Z O O
G A L A X Y Z O O
• Designed serendipity
• Unexpected observations reported on forum.
• Accidental discovery of new classes of objects: green pea galaxies, passive red spirals, Hanny’s Voorwerp
• Dynamic division of labor
• Huge task subdivided in many small tasks which can be easily learned
W H Y V O L U N T E E R ?
• Discovery: being the first to see a galaxy
• Progress: understanding universe, beating cancer,…
• Fun: gamification
• Learning: learning more about a science/topic
• Community: meeting like-minded people
M A C H I N E L E A R N I N G
• Good candidate for networked science
• Highly complex data, code, workflows, yet most work published in papers (graphs, pseudocode)
• Experiments are not shared online: impossible to build on prior work, start each time from scratch
• Low generalisability: studies contradict
• Low reproducibility: code, experiment details missing
• Place to share data in fine detail, and organise it to work more effectively, be more visible, collaborate, tackle hard problems
• Links to data available anywhere online, integrated in popular machine learning environments (WEKA, R, MOA, RapidMiner)
• Website to find data, code, results; discuss, compare, visualise
Data Tasks Flows Runs Studies
Demo
D ATA
F L O W S
TA S K S
TA S K S
TA S K S
R U N S
R U N S
R U N S : D ATA S E T S
R U N S : F L O W S
U N E X P E C T E D
Plugins
W E K A P L U G I N
M O A P L U G I N
R P L U G I N
1 . O P E R AT O R T O D O W N L O A D TA S K ( TA S K T Y P E S P E C I F I C )
R A P I D M I N E R
2 . S U B W O R K F L O W T H AT S O LV E S T H E TA S K , G E N E R AT E S R E S U LT S
3 . O P E R AT O R F O R U P L O A D I N G R E S U LT S
O P E N M L C O N N E C T
• Library for Java
• Package for R
• In progress: Module for Python
• In progress: Command-line tools
F O R S C I E N C E
D E S I G N E D S E R E N D I P I T Y
• `Impossible’ questions become possible by reusing prior experiments
• Answer routine questions in minutes
• Mine all collected results for patterns: meta-learning
• Browse all data for unexpected results
• Reuse code, data in novel ways
D Y N A M I C D I V I S I O N O F L A B O R
• Scientists can focus attention on important problems by adding data, collaborate with community
• Large collaborations: OpenML organizes all results to follow progress
• Benchmark studies: only run algorithms you know well, reuse all other results
• Students, citizen scientists can contribute data, runs through plugins
E X A M P L E : M E TA - Q S A R P R O J E C T
• Large amounts of QSAR data available
• Not known which machine learning techniques are best
• OpenML used to try many algorithms and learn when to use which techniques
• Applications in fighting malaria
B E Y O N D J O U R N A L S
• Enriches research output, linked to papers
• Freely accessible
• Organized online
• Low threshold for students
• Continuously updated
• Immensely detailed
• Reproducible
• Stimulates online discussion
• Diminishes publication bias
S C A L A B I L I T Y
• Easy to make small contributions: add data, code, run experiments using plugins, leave comments
• Split up complex studies: OpenML tasks
• Rich, structured data: all data, flows, runs, users linked. Keyword search, filters, SQL endpoint
• Data easily filtered: easy to focus on your interests
• Enforce scientific standards: task types, verifiability, server-side evaluations, clear attribution, honor code
F O R S C I E N T I S T S
M O R E T I M E
• OpenML assists in most routinizable work:
• Find code and data online
• Setup, run & organize experiments
• Relate to state-of-the-art (benchmarks)
• Annotate code and data
• Full log of your research
• Keep control of your data, code, experiments
• Follow experiments on the go (mobile devices)
M O R E K N O W L E D G E
• Your results linked to everybody else’s
• Larger, more general studies
• Answer more questions
• Mine all combined results
• Find unexpected results
• Interact with others on global scale, get help
• Collaborate with scientists from other fields
M O R E C R E D I T
M O R E C R E D I T
• Citation
• OpenML attributes data, flows, runs, tells others how to cite it
• More easy to find by others
• Altmetrics: track how often your work is reused
• Productivity: contribute efficiently to many studies
• Visibility: collaborate, climb leaderboards, self-publish (tweet)
• Funding: convincing way to make data open
• No publication bias: unexecpected results
F U T U R E W O R K
• OpenML studies: online representation of paper: data, code, runs, discussions,…
• Social layer: control visibility: public, friends, private
• Collaborative leaderboards: all top-3 contributors
• Discussion forum for unexpected results
• More data types, tasks
S P R E A D T H E W O R D , W O R K O P E N LY
#OpenML