Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Computational Social Science Data Gathering Analyzing content Conclusion
Building the Open Computational SocialScience Toolbox
Wouter van Atteveldt
June 2019
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
CSS & Societal Resilience• What, why, how?• Building the Toolchain:
• Data gathering: scraping & tracking• Analysis: Text and beyond
• Open Science and Research Transparency
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Computational Communication Research
Welcoming your submissions!
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Computational Social Science
• Our life is increasingly online• Leaving ’digital traces’• Which can be analysed to study social behaviour
(Lazer et al., 2009, science)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Why now?
• Explosive increase in available data, tools, processing• (and many "big data" is communicative)
• Potential radical boost to study of communication• But has numerous problems, challenges, pitfalls
(Van Atteveldt & Peng, 2018)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
CSS: challenges & pitfalls
• Accessibility of data• Representativeness/validity of ’found’ data• Validity of computational methods• Ethical conduct• Skills & Infrastructure
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CSS: What? Why? How?
Elements of the ’microscope’
• Data: How do we get the ’digital traces’?• Analysis: From (textual) traces to data• Open Science: Resilient Science?
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
CSS & Societal Resilience• What, why, how?• Building the Toolchain:
• Data gathering: scraping & tracking• Analysis: Text and beyond
• Open Science and Research Transparency
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
Why do we need digital trace data?
Fragmentation of information• Minimal mass media effects?• TV as last homogenous medium, demographically
challenged?Specific effects of online communication
• E.g. Fear of online filter bubbles / polarisation
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
Ideal goal
• Message consumption/production data• Full text and metadata• Of representative sample of population
• and/or fully connected subsample(s)• Linked with attitude/behaviour measures
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
Getting text: Technical challenges
• News: can be scraped, retrieved via Nexis etc• Paywalls make it more difficult• Social media companies tries to block scraping
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
"We are in the post-API age"
(Deen Freelon, PolComm, forthcoming)Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
Getting text: Legal challenges
• Copyright/database law can block scraping, blockssharing
• Contract law can block scraping of restricted content• Hacking laws might make scraping actually illegal• Laws are uncertain and vary over jurisdictions/time• Many researchers are anarchists, many institutions
cautious(IANAL!)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
Gathering digital trace dataDesktop browsing:
• Desktop plugin(e.g. ASCoR personalized communication)
• History donation(e.g. Web Historian; Menchen-Trevino)
Mobile app use• App can access (e.g. MobileDNA)
Mobile phone logs• App can access (e.g. Kobayashi & Boase)
Mobile news browsing• problem: how to get it :-)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Mobile news tracking
Mobile news viewing: Challenges
We want to know what people see on their mobile, but• Most mobile browsers don’t allow plugins• HTTPS/encrypted app communication makes
proxy/MITM difficult• esp. combined with certificate pinning
• Most apps have in-app browsing; proprietary protocols
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Mobile news tracking
Mobile news viewing: possibilities
1 Browser sync + desktop plug-in / application• Can build on browser plugins
2 GDPR requests by user• No facility to make request on behalf user• Instructions needed for each app• (need something akin to FSD)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Mobile news tracking
CCS.Amsterdam: Tracking the filter bubble
• NWO-Joint Escience Data Science (JEDS) programme• Goal: develop mobile tracking, analyse effects of mobile
news on attitudes• Team:
• Social science: me, Damian Trailling (UvA) , JudithMoller (UvA), Felicia Locherbach (VU)
• Engineering: Antske Fokkens (VU), Laura Hollink(CWI), Jisk Attema (NLeSC), Laurens Bogaardt(NLeSC)
• Law/normative theory: Natali Helberger (UvA)
(See Van Atteveldt et al., ICA 2019; and ICA postconf)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
CSS & Societal Resilience• What, why, how?• Building the Toolchain:
• Data gathering: scraping & tracking• Analysis: Text and beyond
• Open Science and Research Transparency
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
The need for content analysis
• Trace data often partially unstructured/symbolic (text,speech, image, video)
• Need to convert ’text to data’• Measure relevant quantities• In a valid and scalable way
• Focus on text• But really cool stuff is happening with image analysis,
see e.g. ICA pre-conf & panel!
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
What are the relevant quantities?Depends on RQ, but often see message as (collection of)statement(s):
• source• topic and/or target• tone/sentiment
This can yield measurement per message, or per statement• Per message is easier, but many texts contain multiple
statements• Construct semantic network from text
• (Core Sentence approach; political claims analysis; NET)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Goals and challenges
What techniques do we need?• Identifying actors: (easiest)
• Dictionaries, Named Entity Recognition, Coreferenceresolution
• Identifying issues/topics: (doable)• "Automatic text classification"• Dictionaries, (Structural) Topic modeling, Supervised
machine learning• Identifying tone/sentiment: (hard!)
• "Sentiment anlysis"/"Opinion extraction"• Dictionaries, Supervised machine learning
• From text to statements• "Semantic Role Labeling" (sort of)• Syntactic analysis, Supervised machine learning
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
CCS.Amsterdam: Text analysis
• Multiple projects on Sentiment analysis, syntacticanlaysis, deep learning, crowd coding, topic modeling, etc
• Members (i.a.): Damian Trilling (UvA), Anne Kroon(UvA), Kasper Welbers (VU), Antske Fokkens (VU)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
The importance of sentiment analysis
• Many theories in (political) communication connected totone
• Issue positions• Negative campaigning• Conflict news• Reviews, reputation, etc
• Tone is notoriously hard to define & measure(automatically)
• Ambiguous• Creative
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Test case: Dutch economic news
• Is news positive or negative about economy?• Interesting for retrospective voting, framing, news bias• Should be ‘best-case’ scenario for automatic analysis
• Relatively unambiguous• Relatively factual
• RQ: Can we automatically measure the tone of economicnews?
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
How do off-the-shelf dictionaries do?
• Mark Boukes et al., ICA 2018• Compare undergrad coders with existing dictionaries
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
What else can we try?
Compare (triple-coded) gold standard with:• Undergrads• Dictionaries• Crowd coding• (translation + dictionaries)• Machine learning
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Gold standard
• Selected ~300 headlines from ICR sample• Coded independently by me, Mark, Mariken van der
Velden (α=.78)• All differences resolved except some remaining
disagreements• (E.g. “Interest rates hit zero”, “Greece will be fine for a
couple more weeks”, “Aging population puts brake onhouse prices”)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Crowd coding
• Crowd coding promising solution for sentiment coding• Decision is simple / “intuitive”• More cheap coders > Fewer better coders• Method:
• Same n~300 sentences• Each sentences coded by ~5 coders• (.02$ per sentence/coders, <50$ total)• Simple instructions, single question• Use gold questions to filter coders
[note: current results based on subset of data]
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Machine learning
• Train on 6,203 manually coded headlines• Test on gold sample• Compare:
• ’Traditional’ SVM on document-lemma matrix• ’Deep learning’ Convolutional Neural Network with word
embeddings
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Problems with classical machine learning
• Data scarcity• Never enough coded data available• More parameters than cases• Words not in training material have unknown ’meaning’
• Simplistic representation• Bag of words
• "it wasn’t bad, it was actually quite good"• "it wasn’t good, it was actually quite bad"
• Richer features increase data scarcity (and requiredomain/NLP knowledge)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Solution part 1: Word embeddings• Bag of words treats each word as unique• Similar words can be treated similarly• Treat each word as (relatively small) vector of scores, so:
• unseen word can be interpolated• fewer parameters need to be trainied
• Embedding vectors based on (very large) uncoded textcollection
• Amsterdam Embedding Model trained on >10M newsarticles (Kroon et al, ICA 2019)
• Trained to maximize prediction of each word based oncontext window
Note: essentially dimensionality reduction similar to factor analysis,topic modeling, latent semantic indexing (but more effective due totraining on more data and using explicit contexts)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Solution part 2: "Deep Learning"
• Deep learning builds richer features as part of training• Possibility to use context of words• See: Yoav Goldberg, Neural Network Methods for Natural
Language Processing; Anne Kroon et al, 2019 ICA onDutch embeddings.
• This study: Convolutional Neural Network• Method originating from image analysis• N-grams of words representations are concatenated,
pooled per unit• Pooled output is then used as input for regular learning
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Convolutional Neural Network
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Results
• How do all methods compare to gold standard?• How do methods correlate with each other?
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
CCS.Amsterdam: Sentiment Analysis
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Open Science: Sharing & Transparancy
CSS & Societal Resilience• What, why, how?• Building the Toolchain:
• Data gathering: scraping & tracking• Analysis: Text and beyond
• Open Science and Research Transparency
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Open Science: Sharing & Transparancy
What is open science?
• Research Transparency:• data access• transparent design• analytical transparency
• Oppenness boosts:• Reproducibility• Robustness• Replicability• Generalizability
(e.g. Bowman & Keene, 2018; Klein et al., 2018; Munafò etal., 2017; Nosek et al., 2015)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Open Science: Sharing & Transparancy
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Open Science: Sharing & Transparancy
Why open CSS?
Opportunity:• Data and tools are digital• Culture of open source, sharing• Possibility for reproducible research
Need:• Big data can be abused easily• Skills and tools can be scarce• Strong need for openness: share, inspect, improve
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Open Science: Sharing & Transparancy
Open CSS: ChallengesData sharing challenges:
• Fear of being scooped• Proprietary data, copyright, legal uncertainty• Not enough incentives
Code/Tool sharing challenges:• Fear of being caught making mistakes• Effort required to turn research code into software• Not enough incentives
(Van Atteveldt et al., 2019, IJoC; Van Atteveldt et al., 2019,CCR)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
Open Science: Sharing & Transparancy
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
What’s next? Towards an open CSS
What’s next? Towards an open CSS
• Building the toolchain, doing the research• Focus on validity, usability, re-usability:
• Sharing the tools• Sharing the data• Sharing the results• Sharing the skills
• This effort needs to be collaborative!• See https://github.com/ccs-amsterdam/
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
What’s next? Towards an open CSS
Conclusion
• Computational Social Scienceoffers great promise to study societal resilience
• Need to overcome specific challenges• Acquiring data• Building tools• Growing skills (& instutitional incentives)
• Requires Transparant, Open, Collaborative researchto thrive
Building the Open Computational Social Science Toolbox Wouter van Atteveldt
Computational Social Science Data Gathering Analyzing content Conclusion
What’s next? Towards an open CSS
Some links• https://vanatteveldt.com (this talk, publications)• https://ccs.amsterdam (project descriptions)• https://github.com/ccs-amsterdam (code)• https://computationalcommunication.org
(submit & first issue!)
Building the Open Computational Social Science Toolbox Wouter van Atteveldt