Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1 | 15-09-2016 1 | 15-09-2016
Big data: overvloed en onbehagen
Kees Aarts SWR conferentie, 16-17 september 2016
2 | 15-09-2016
Inhoud › KNAW-Verkenningscommissie › Wat is big data? › Big data en onderzoeksmethodologie › Spanningsvelden › Toekomst
3 | 15-09-2016
Commissie Big Data Ingesteld september 2015. De commissie heeft twee taken: • uitvoeren van een brede verkenning naar
effecten van ‘big data’ op wetenschappelijk onderzoek met het accent op wetenschapsgebieden die werken met personen
• voorbereiden van een KNAW-advies over enkele geselecteerde onderwerpen.
4 | 15-09-2016
Gevolgde werkwijze › Discussiebijeenkomsten met focusgroepen:
§ Onderzoekers in big data § Informatica-specialisten § (komt nog) Jongere onderzoekers in big data
5 | 15-09-2016
Dutcher (2014)
What Is Big Data? - Blog https://datascience.berkeley.edu/what-is-big-data/
1 of 12 13-11-2015 4:30
6 | 15-09-2016
Een vaag omlijnd begrip › Big data: wat is ‘big’? De drie v’s (volume,
velocity, variety) › Verwante maar onderscheiden termen
§ Data science § E-science (e-humanities) § Computational social science § Data-driven research § Open access, open data, open science
7 | 15-09-2016
Volume, velocity › Camerabeelden, GPS gegevens, social media (Twitter; Hosch-Dayican et al. 2014), zoekgedrag op web
subsample of the text documents. The hand-coded subsample is then used as a training set to classifyand code the rest of the documents automatically in the second step. This ‘‘supervised learningapproach’’ to classification has several advantages over automated methods based on the use of dic-tionaries. First of all, the need for a clear coding scheme urges researchers to develop clear defini-tions of concepts to be measured and studied. Second, supervised learning methods are easier tovalidate, with clear statistics that summarize model performance. Third, the probability of misclas-sification of text that does not contain straightforward language, such as tweets with sarcastic con-tents, can be reduced due to the use of human coders (see Grimmer & Stewart, 2013; Hopkins &King, 2010).9
For taking the first step, a random subsample corresponding to approximately 1% of all the tweetswas drawn from the corpus. Four coders were appointed to manually code this subsample indepen-dent of each other using a coding scheme that was developed by the authors. All coded variableswere then tested for intercoder reliability using Krippendorff’s a, the result of which showed a highlevel of agreement implying that the coding scheme is dependable.10
The hand-coded data were then used as a training set to code the rest of the tweets. In order toclassify the text, we implemented a naive Bayes classifier (in PhP). To improve the performanceof the classifier, we used unigrams as well as bigrams. We also removed common Dutch stop wordsand used word stemming in order to deal with only the stems of the words. With this classifier, wefirst classified the tweets on whether they were related to politics. For this, two sets of 388 tweetswere used for training purposes. One set consisting of politically related tweets and the other con-sisting of tweets not related to politics. We used 10-fold cross-validation to arrive at the values
Tweets on Dutch Elections 2012
No electoral campaigning
Electoralcampaigning
Persuasivecampaigning
Negativecampaigning
Electoral campaigning
Type of campaigning
Figure 1. An overview of the nested structure of the variables.
Table 2. The Precision, Recall, and Accuracy of the Classifier for Predicting if a Tweet Is Related to DutchParliamentary Elections (Correct to two Decimal Places).
Type of Tweet Precision Recall Accuracy F Measure
Related to elections 0.93 0.73 0.84 0.82Not related to elections 0.78 0.94 0.85
8 Social Science Computer Review
at Universiteit Twente on August 17, 2015ssc.sagepub.comDownloaded from
8 | 15-09-2016
Variety › Stelsel van sociaal-statistische bestanden (CBS): virtuele volkstelling (Bakker et al. 2014)
416 B.F.M. Bakker et al. / The System of social statistical datasets
Fig. 2. Conceptual model of the SSD register system. [Rectangles:object types; lines: relations between object types; PIN: person iden-tification number; HIN: household identification number; AIN: ad-dress identification number; OIN: organization identification num-ber; the indication x:y denotes the type of relation].
sence of coordination. Moreover, data sharing amongorganizational units entails increased interdependencyas well as the potential for unwanted output overlap.Therefore, being able to monitor the production sched-ules of other units is of paramount importance. In short,coordination is essential to simplify the combined useof data, to increase consistency between statistical reg-isters, avoid duplicated work, ensure the appropriateapplication and interpretation of data, and for plan-ning and control. Four types of coordination are distin-guished: organizational, technical, content-related andoutput-related. These will be examined consecutivelybelow.
Organizational coordinationSN’s Division of Socioeconomic and Spatial Statis-
tics consists of a number of organizational units. Eachunit is responsible for the production of statistical out-put pertaining to a specific domain, e.g. employment,social security, demography. These units carry out reg-ister processing and store the resulting statistical reg-isters in the central data library of the SSD. They arethe formal owners of these registers, which means theyare accountable for the timely processing as well asthe quality of the registers. Several supporting tasks areperformed by two central organizational units: one isresponsible for assigning linkage keys to statistical reg-isters. To that end, it maintains the CLFP and devel-ops and applies matching algorithms. The other centralorganizational unit carries out a broad range of activi-ties aimed at the integrity of the SSD and the efficient
use of its contents. For instance, it performs micro-integration of different statistical registers, developsand maintains software tools and provides courses onthe principles of the SSD. Lastly, two consultation bod-ies are worth mentioning. First, representatives of allorganizational units participate in a consultative bodywhich aims to coordinate the contents and technical as-pects of the SSD. Second, a steering committee over-sees current and future aspects of the SSD and takesaction in the case of conflicts of interest.
Technical coordinationStandardization is the most prominent aspect of
technical coordination within the SSD. File formats,data formats of linkage keys, naming conventions,metadata, IT infrastructure and planning tools are allstandardized. Technical coordination also aims to pre-vent redundancy (the same variable in different statis-tical registers) and ambiguity (same variable under dif-ferent names). In addition, a key feature of the SSDis an unambiguous link between data and metadata(Fig. 3). Meta-information and its structure is impor-tant for the proper processing and understanding ofstatistical data e.g. [17,44]. The transition to register-based statistics has broadened the demands on meta-data as it entails a stronger dependence on external fac-tors such as legislation underlying the administrativeregisters, variable definitions and data collection meth-ods employed by the register keeper [11,36,51]. Themetadata of the SSD are stored in a central metadatarepository. Statistical registers are connected one-to-one with their corresponding metadata files, on the ba-sis of the register name. Similarly, variables are relatedto their metadata on the basis of the variable name.
Content-related coordinationSeveral processes are directed at the coordination
of content. Firstly, when either new statistical registersor modifications of existing registers are developed,the specifications are sent to all organizational units toenable stakeholders to contribute comments that rep-resent their interests. Secondly, a central productionschedule is kept within the SSD framework. Organi-zational units make their own timetables using a stan-dardized planning tool. These timetables are automati-cally incorporated into a central schedule which can beconsulted by all the units. Thirdly, if a historical reg-ister is updated frequently in order to produce timelystatistics, coordinated versions are identified which areto be used for all statistics with less strict timelines.For instance, the demographic register, which is de-
9 | 15-09-2016
Paradigmawisseling (Hey et al. 2009)
10 | 15-09-2016
Data is een misleidende term › Data zijn nooit gegeven maar worden altijd
geconstrueerd (waarnemingstheorie, datatheorie) § Iemand maakt de keuze wat wordt
waargenomen; die keuze heeft gevolgen voor geldigheid en betrouwbaarheid
§ Een observatie kan worden geïnterpreteerd als uiteenlopende data
› Dit wordt vaak vergeten als het om big data gaat
11 | 15-09-2016
Toetsen verliezen hun betekenis › Conventies bij statistische toetsen zijn
ontwikkeld vanuit minimalistisch, experimenteel perspectief (hoe groot moet n zijn om een verdeling te benaderen? Wat is bij die n een acceptabele type-I fout?)
› Bij grote n wordt volgens deze conventies vrijwel elk verband significant
12 | 15-09-2016
Geldigheid wordt problematisch › Externe geldigheid: in hoeverre zijn de data/
relaties generaliseerbaar? › Interne geldigheid: in welke mate kun je een
correlatie causaal interpreteren?
13 | 15-09-2016
Verificatie en replicatie › Data zouden moeten voldoen aan de FAIR
principes: § findable § accessible § interoperable § re-usable
14 | 15-09-2016
Eigenaarschap (Einav & Levin 2014)
RESEARCH
7 NOVEMBER 2014 • VOL 346 ISSUE 6210 7 15SCIENCE sciencemag.org
BACKGROUND: Economic science has
evolved over several decades toward
greater emphasis on empirical work. The
data revolution of the past decade is likely
to have a further and profound effect on
economic research. Increasingly, econo-
mists make use of newly available large-
scale administrative data or private sector
data that often are obtained through col-
laborations with private firms, giving rise
to new opportunities and challenges.
ADVANCES: These new data are affecting
economic research along several dimen-
sions. Many fields have shifted from a
reliance on relatively small-sample govern-
ment surveys to administrative data with
Economics in the age of big data
ECONOMICS
Liran Einav1,2* and Jonathan Levin1,2
The rising use of non–publicly available data in economic research. Here we show the
percentage of papers published in the American Economic Review (AER) that obtained an ex-
emption from the AER’s data availability policy, as a share of all papers published by the AER
that relied on any form of data (excluding simulations and laboratory experiments). Notes and
comments, as well as AER Papers and Proceedings issues, are not included in the analysis. We
obtained a record of exemptions directly from the AER administrative sta� and coded each ex-
emption manually to re� ect public sector versus private data. Our check of nonexempt papers
suggests that the AER records may possibly understate the percentage of papers that actually
obtained exemptions. The asterisk indicates that data run from when the AER started collecting
these data (December 2005 issue) to the September 2014 issue. To make full use of the data,
we de� ne year 2006 to cover October 2005 through September 2006, year 2007 to cover
October 2006 through September 2007, and so on.
2006 2007 2008 2009 2010 2011 2012 2013 2014
91%86%
95%
80%
67%71%
72%
54% 55%
20%22%
15%19%
20%
13%
7%
7% 7%4%
4% 5%12% 10% 13%
24%26%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Publication year*
Sh
are
of
all
pu
bli
sh
ed
pa
pe
rs w
ith
da
ta
No exemption
Exemption (private data)
Exemption (administrative data)
REVIEW SUMMARY
universal or near-universal population
coverage. This shift is transformative, as it
allows researchers to rigorously examine
variation in wages, health, productivity,
education, and other measures across dif-
ferent subpopulations; construct consis-
tent long-run statistical indices; generate
new quasi-experimental research designs;
and track diverse outcomes from natural
and controlled experiments.
Perhaps even more notable is the expan-
sion of private sector data on economic
activity. These data, sometimes available
from public sources but other times ob-
tained through data-sharing agreements
with private firms, can help to create more
granular and real-time measurement of ag-
gregate economic statistics. The data also
offer researchers a look inside the “black
box” of firms and markets by providing
meaningful statistics on economic behav-
ior such as search and information gath-
ering, communication, decision-making,
a n d m i c r o l e vel t r a ns-
actions. Collaborations
w i t h d a t a - o r i e n t e d
firms also create new
opportunities to con-
duct and evaluate ran-
domized experiments.
Economic theory plays an important
role in the analysis of large data sets with
complex structure. It can be difficult to or-
ganize and study this type of data (or even
to decide which variables to construct)
without a simplifying conceptual frame-
work, which is where economic models
become useful. Better data also allow for
sharper tests of existing models and tests
of theories that had previously been diffi-
cult to assess.
OUTLOOK: The advent of big data is al-
ready allowing for better measurement
of economic effects and outcomes and is
enabling novel research designs across a
range of topics. Over time, these data are
likely to affect the types of questions econ-
omists pose, by allowing for more focus
on population variation and the analysis
of a broader range of economic activities
and interactions. We also expect econo-
mists to increasingly adopt the large-data
statistical methods that have been devel-
oped in neighboring fields and that often
may complement traditional econometric
techniques.
These data opportunities also raise some
important challenges. Perhaps the primary
one is developing methods for researchers
to access and explore data in ways that re-
spect privacy and confidentiality concerns.
This is a major issue in working with both
government administrative data and pri-
vate sector firms. Other challenges include
developing the appropriate data manage-
ment and programming capabilities, as
well as designing creative and scalable
approaches to summarize, describe, and
analyze large-scale and relatively unstruc-
tured data sets. These challenges notwith-
standing, the next few decades are likely
to be a very exciting time for economic
research. �
1Department of Economics, Stanford University, Stanford, CA 94305, USA. 2National Bureau of Economic Research, 1050 Massachusetts Avenue, Cambridge, MA 02138, USA.*Corresponding author. E-mail: [email protected] this article as L. Einav, J. Levin, Science 346, 1243089 (2014); DOI: 10.1126/science.1243089
Read the full article at http://dx.doi.org/10.1126/science.1243089
ON OUR WEB SITE
Published by AAAS
15 | 15-09-2016
AOL searcher No. 4417749 “My goodness, it’s my whole personal life…I had no idea somebody was looking over my shoulder.”
16 | 15-09-2016
Persoonsbescherming › Mensen zijn zich doorgaans volstrekt
onvoldoende bewust van de geïntegreerde kennis die over hun persoon en hun gedrag beschikbaar is
› Disclaimers worden niet begrepen
17 | 15-09-2016
Infrastructuur nodig! › Data infrastructuur:
§ Voor de kwaliteit van metingen § Voor methodologische en statistische
expertise § Voor maximale generaliseerbaarheid § Om de FAIR principes operationeel te maken § Om eigenaarschap te regelen § Om privacy te beschermen
18 | 15-09-2016
Twee stappen gezet NDSW Dataplatform voor de mens- en maatschappijwetenschappen Koepelvoorstel nieuwe nationale roadmap Start: 27 oktober
M3 Onderdeel van KNAW Agenda Grootschalige Wetenschappelijke Infrastructuur Integreert biologie, medicijnen, genetica, informatica