Big Data activities at SURS Statistical Office of the Republic of Slovenia DIME/ITDG meeting, February 2016

Embed Size (px)

DESCRIPTION

Data Sources -Surveys -Administrative sources -Big Data 3

Citation preview

Big Data activities at SURS Statistical Office of the Republic of Slovenia DIME/ITDG meeting, February 2016 Aim of official statistics Support data users: -Government -Politicians and legislators -Markets -The public -The media -International community 2 Data Sources -Surveys -Administrative sources -Big Data 3 Big Data Possible usage -New statistics -New (or combined) sources for existing statistics -Validation (Benchmarking) of data and statistics -Different mode of data collection -Flash statistic -Faster release of statistics AAPOR Report on Big Data (2015): Surveys and Big Data are complementary data sources, not competing data sources. 4 Current activities at SURS Analysis of different types of Big Data and possibilities of their usage in regular statistical production (mobile positioning data, scanned price data, web scraping, etc.) IT infrastructure Partnership with stakeholders (data owners, academia, etc.) Active participation in different international task forces (Eurostat BD Task Force, UNECE BD Task Force) and projects (ESSNET grant pilots) 5 Statistical model and new sources Scanned & scraped data of prices and job vacancies New type of statistics on mobile positioning data Comparison between job vacancies statistics from survey and scraped data Web scraping system for identifying job advertisements 7 Process of creating the collection tool Spider: The aim of Spider is to take a company website and find all webpages (sub links) on this website that relate to employment. Downloader: The task of Downloader is simply to download the content of the saved URL links (problems with the pdf files and https). Splitter: The aim of Splitter is to split the content of the certain URL into different documents. Determinator: The aim of Determinator is to detect the JV ads in the documents from Splitter. Classifier: The aim of Classifier is to classify the detected JV, for example by occupation, deadline, address, region. Process of creating the collection tool Two different approaches of detecting the JV ads are currently being carried out: Usage of "decision tree" on the content of downloaded URLs Usage of the list of common key words and phrases (whitelist and blacklist of words) in order to detect the JV ads from the content of downloaded URLs Job Ads Statistics - initial results 10 Mobile positioning and statistical derivatives Mobile operators - 4 mobile network operators - 3 service providers - 3 re-sellers - first 4 are primary data providers - all network operators and service providers could be/are important! 11 Mobile data For the investigation purposes, SURS had access to data from the second largest mobile operator in Slovenia Data from April to October 2014 (1 billion records) Three variables - Anonymized IMEI, -Time of event (outgoing call, outgoing SMS, connecting to internet using mobile phone) -Coordinates of antennas 12 Daytime Density of people in Ljubljana during the day Density of people in Ljubljana during the night 14 BD activities in 2016 (1) In the February the set of workshops will be organized with the subject matter statisticians. Goal: brainstorm the ideas and preparation a business cases for usage of BD in different domains of statistical production 15 BD activities in 2016 (2) Deepen cooperation with Slovenian universities: Goal: Education of colleagues Usage of data mining (and collection)tools developed by Slovenian facultiesor Cooperation in projects 16 BD activities in 2016 (3) Active part in ESSNET BD project ( one of WP leaders) Organization of Eurostat Big Data Workshop in Slovenia and contribution in ethical review and ethical guidelines which is to be prepared this year. Continuation of ongoing work in local projects (Job vacancies data from enterprise websites and CEMODE) 17 Open questions Access to data (legal issues, partnership, etc.) Big data are used for different purposes (different definitions) There is no control of the collection process Data could change or even extinct Public perception IT and methodological skills IT infrastructure Quality of data 18