View
91
Download
0
Category
Preview:
Citation preview
Data Ethics in Data Science Education
(plus: Science Data, Responsibly)
Bill HoweUniversity of Washington
05/03/2023 2
Plan
• context: eScience Institute (1 min)• context: Data Science MOOC (3 min)• Vignette on Teaching Data Ethics (5
min)
• Science Data, Responsibly (6 min)– Automated Curation– Viziometrics
Data, Responsibly @ Dagstuhl
• People• Research Staff (~4 100% Data Scientists, ~4 50% Research
Scientists)• Postdocs (~12 at steady state)• Faculty (~9 Exec Committee, ~20 Steering Committee, ~100
Affiliates)• Adminstrative Staff (Program Managers, Finance, Admin)
• Programs– Short and long-term research, education programs
ugrad/masters/Phd, software, research consulting – Leadership on all things data science around campus
• Funding• $700k / yr permanent appropriation from the state of WA• $32.8M for 5 years jointly with NYU and UC Berkeley from the
Gordon and Betty Moore Foundation and the Alfred P Sloan Foundation to build a “Data Science Environment”
• $9M for 5 years from the Washington Research Foundation• $500k / yr from the Provost for half-lines for recruiting in relevant
fields
05/03/2023 4Bill Howe, UW
05/03/2023 5
Data Science Education
Bill Howe, UW
Students Non-StudentsCS/Informatics Non-Major professionals researchersundergrads grads undergrads grads
(2011) Data Science Certificate (2013) Data Science MOOC(2013) NSF IGERT Big Data PhD (2013) New CS Courses (2016) Data Science Masters (2015) Data Sci. for Social Good
Data Ethics being incorporated in all programs
Session 2Summer 2014
121,215 students
Session 1 Spring 2013
119,504 students
Introduction to Data Science MOOC on Coursera
Participation numbers• “Registered:” 119,517 totally
irrelevant• Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663• Completed all assignments: ~9000 typical for a MOOC• “Passed:” 7022• Forum threads: 4661• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want– Many love it in parts, start late, don’t turn in homework, etc.– Learning rather than watching television
Syllabus• Data Science Landscape (~1 week)• Data Manipulation at Scale
– Relational Databases (~1 week)– MapReduce (~1 week)– NoSQL (~1 week)
• Analytics– Statistics Topics (~1 week)– Machine Learning Topics (~2 weeks)
• Visualization (~1 week)• Graph Analytics (~1 week)
2015: MOOC Recast as a 4-course “Specialization”
Data Manipulation at ScaleDatabases, Systems, Algorithms
Practical Predictive AnalyticsStats (resampling methods, multiple hypothesis testing, more)ML (rules/trees/forests, ensembles/boosting/bagging, SVMs, GD,
eval…)Communicating Data Science
Visualization, ethics and privacyCapstone
05/03/2023 10
VIGNETTE ON TEACHING DATA ETHICS
Bill Howe, UW
Alcohol Study, Barrow Alaska, 1979
Native leaders and city officials, worried about drinking and associated violence in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions.
Methods
• 10% representative sample (N=88) of everyone over the age of 15 using a 1972 demographic survey
• Interviewed on attitudes and values about use of alcohol
• Obtained psychological histories including drinking behavior
• Given the Michigan Alcoholism Screening Test (Seltzer, 1971)
• Asked to draw a picture of a person– Used to determine cultural identity
Results announced unilaterally and publicly
At the conclusion of the study researchers formulated a report entitled “The Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues Eskimos
The results of the Barrow Alcohol Study in Alaska were revealed in the context of a press conference that was held far from the Native village, and without the presence, much less the knowledge or consent, of any community member who might have been able to present any context concerning the socioeconomic conditions of the village. Study results suggested that nearly all adults in the community were alcoholics. In addition to the shame felt by community members, the town’s Standard and Poor bond rating suffered as a result, which in turn decreased the tribe’s ability to secure funding for much needed projects.
Backlash
Methodological Problems
“The authors once again met with the Barrow Technical Advisory Group, who stated their concern that only Natives were studied, and that outsiders in town had not been included.”
“The estimates of the frequency of intoxication based on association with the probability of being detained were termed "ludicrous, both logically and statistically.””
Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study
Ethical Problems
• Participants were not in control of their data nor the context in which they were presented.
• Easy to demonstrate specific, significant harms:– Social: Stigmatization– Financial: Bond rating lowered
• Important: Nothing to do with individual privacy– No PII revealed at any point, to anyone– No violations of best practices in data handling– But even those who did not participate in the study incurred
harm
Two Topics
• Social Component: Codes of Conduct• Technical Component: Managing Sensitive Data
Ethical principles vs. ethical rules
• In the Barrow example, ethical rules were generally followed
• But ethical principles were violated: The researchers appear to have placed their own interests ahead of those of the research subjects, the client, and society
Principles: Codes of Conduct
• American Statistical Association– http://www.amstat.org/committees/ethics/
• Certified Analytics Professional– https://www.certifiedanalytics.org/ethics.php
• Data Science Association– http://www.datascienceassn.org/code-of-conduct.
html
05/03/2023 20
SCIENCE DATA, RESPONSIBLY
Bill Howe, UW
05/03/2023 21
Science is a complete mess• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that
approximated the original result (Science, 2015)– Ioannidis 2005: Why most public research findings are false– Reinhart & Rogoff: global economic policy based on spreadsheet
fuck ups
Bill Howe, UW
Science, 2015
05/03/2023 23Data, Responsibly @ Dagstuhl
Retractions are increasing…..
05/03/2023 24
Science is a complete mess• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that
approximated the original result (Science, 2015)– Ioannidis 2005: Why most public research findings are false– Reinhart & Rogoff: global economic policy based on spreadsheet
fuck ups• Fraud
– Diederik Stapel: 38 articles with fictitious data– Bharat Aggarwal: a huge number of images with evidence of
manipulation
Bill Howe, UW
Bharat Aggarwalalleged data manipulation
05/03/2023 27
Science is a complete mess• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that
approximated the original result (Science, 2015)– Ioannidis 2005: Why most public research findings are false– Reinhart & Rogoff: global economic policy based on spreadsheet
fuck ups• Fraud
– Diederik Stapel: 38 articles with fictitious data– Bharat Aggarwal: a huge number of images with evidence of
manipulation• Public Trust
– Churn: Chocolate, egg yolks, red meat, red wine, etc.– Climate change, vaccines
Bill Howe, UW
05/03/2023 32
Vision: Validate scientific claims automatically– Check for manipulation (manipulated images, Benford’s Law)– Extract claims from papers– Check claims against the authors’ data– Check claims against related data sets– Automatic meta-analysis across the literature + public
datasets
• First steps– Automatic curation: Validate and attach metadata to public
datasets– Longitudinal analysis of the visual literature
Data, Responsibly @ Dagstuhl
“DEEP” CURATIONScience Data, Responsibly
Microarray experiments
05/03/2023 41Bill Howe, UW
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the bottleneck to data sharing
Maxim Gretchkin
Hoifung Poon
color = labels supplied as metadata
clusters = 1st two PCA dimensions on the gene expression data itself
Can we use the expression data directly to curate algorithmically?
Maxim Gretchkin
Hoifung Poon
The expression data and the text labels appear to disagree
Maxim Gretchkin
Hoifung Poon
Better Tissue Type Labels
Domain knowledge (Ontology)
Expression data
Free-text Metadata
2 Deep Networkstext
expr
SVM
Deep Curation Maxim Gretchkin
Hoifung Poon
Distant supervision and co-learning between text-based classified and expression-based classifier: Both models improve by training on each others’ results.
Free-text classifierExpression classifier
Deep Curation: Our stuff wins, with no training data
Maxim Gretchkin
Hoifung Poon
state of the art
our reimplementation of the state of the art
our dueling pianos NN
amount of training data used
05/03/2023 46
VIZIOMETRICS:COMPREHENDING VISUAL INFORMATION IN THE SCIENTIFIC LITERATURE
Human-Data Interaction
Bill Howe, UW
Step 1: Dismantling Composite Figures
Poshen Lee
ICPRAM 2015
Do high-impact papers have fewer equations, as indicated by Fawcett and Higginson? (Yes)
Poshen LeeJevin West
high impact papers low impact papers
Do high-impact papers have more diagrams? (Yes)
Poshen LeeJevin West
TEACHING DATA ETHICS IN DATA SCIENCE
Session 2Summer 2014
121,215 students
Session 1 Spring 2013
119,504 students
Participation numbers• “Registered”: 119,517 totally
irrelevant• Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663• Completed all assignments: ~9000 typical for a MOOC• “Passed”: 7022• Forum threads: 4661• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want– Many love it in parts, start late, don’t turn in homework, etc.– Learning rather than watching television
Lectures• Data Science Context and Case Studies (~1 week)• Data Management at Scale
– Relational Databases (~1 week)– MapReduce (~1 week)– NoSQL (~1 week)
• Topics in Analytics– Permutation Methods, Bayesian Methods (~1 week)– Machine Learning Algorithms and Evaluation (~1 week)
• Visualization (~1 week)• Graph Analytics (~1 week)• Guest Lectures
05/03/2023 56Bill Howe, UW
Who took the course?
05/03/2023 57Bill Howe, UW
Who took the course?
05/03/2023 58Bill Howe, UW
Who took the course?
What programming language do you typically use?
??
05/03/2023 59Bill Howe, UW
05/03/2023 60Bill Howe, UW
05000
100001500020000250003000035000400004500050000
Attrition, video lectures
Number of students watching videos by segment, ordered by time
05/03/2023 62Bill Howe, UW
1
2
3
4
5
6
Database
1
Database
2
Database
3
Database
4
Database
5
Database
6
Database
7
Database
8
Database
9
MapRed
uce 1
MapRed
uce 2
MapRed
uce 3
MapRed
uce 4
MapRed
uce 5
MapRed
uce 6Kag
gle
Tablea
u0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Attrition, assignments
Number of students completing assignments by part
05/03/2023 64Bill Howe, UW
Who took the course?
In a directory with 1000 text files, you are asked to create a list of files that contain the word Drosophila
05/03/2023 65Bill Howe, UW
Who took the course?
What if you were given a billion documents spread across many computers and asked to count the occurrences of a given phrase?
“I left the company I co-founded in 2005 to do data analytics with Wibidata, with whom I was introduced as a result of their guest lecture in your course.
Recommended