Analytics in Official Statistics - Sas Institute › ... ›...

Analytics in Official Statistics: From

Adaptive Survey Design to the U.S. 2020

Census

Michael T. Thieme

Assistant Director for Systems and Contracts

Decennial Census Programs

U.S. Census Bureau

The thoughts and opinions in this presentation are those of the presenter and not necessarily those of the U.S. Census Bureau

Disclaimer

▪Survey costs are rising

▪Confidence in government is declining

▪With it, the Public’s willingness to participate in surveys

▪Current methods for producing official statistics are unsustainable

Where we are:

How did we get here?

How do we keep going?

One Barrier to Where We Want to Go

1 Source: Fostering Interoperability in Official Statistics: Common Statistical Production Architecture (UNECE, 2013)

Accidental Architecture

This is what Accidental Architecture looks like at Census:

The Result?

▪Higher system costs

▪development, operations and maintenance

▪Nearly nonexistent interoperability

▪ Less data accessibility, discoverability, usability

▪Much more difficult to use data analytics and adaptive survey design approaches

Part of the Answer:

Adaptive Survey Design

Survey Data Collection Platform as a Service

A new approach at theU.S. Census Bureau

Concurrent Analysis and Estimation System

Unified Tracking System (Paradata Repository)

Centralized Operational Analysis and Control (Multimode Operational Control System)

CaRDS(ACS/

Decennial )

MAF/TIGER(Decennial)

Business Register(ECON)

Frame and Sample Systems

CaRDS(ACS and

Decennial)

BR, MADB, StEPS II(ECON)Integrated Field Operation Control

Systems

Time & Attendance

Systems

Response Processing

Systems

Centurion/ISR

iCADE ATAC CQA/IVRCOMET CLMS Enumeration

Modest Beginnings

▪National Survey of College Graduates▪Developed R-Indicator Model

▪Ran experiments

▪Built confidence

Modest Beginnings

▪Census Tests

▪2014, 2015, 2016

▪Administrative record modeling

▪Optimization of field work

▪Changed the way we do Censuses

The U.S. 2020 Census

Using Analytics to:

▪Optimize the 2020 Census paid advertising campaign

▪ Identify vacant housing units

▪Optimize the number of enumeration attempts

▪ Identify best time to knock on doors

▪Optimize field worker efficiency

2. SAS 9.4 for non-distributed processing

1. Hortonworks Hadoop for storage and in-database processing

3. SAS Viya for distributed processing

CAES 2020 Production Environment

3 years of testing non-distributed versus distributed along 3 dimensions

1. Performance: How fast can we go?

2. Accuracy: When we go fast, do we come up with the same result?

3. Cost: What does it take to achieve better performance and the same level of precision?

Technology Performance Accuracy Cost

2015 Pilot SAS LASR In-Memory

2016 Pilot SAS In-Database (via Map Reduce)

2018 Pilot SAS Viya In-Memory ? ? ?

The Journey to CAES 2020

Business Goal: speed up Decennial Administrative Records process

1. Performance:

Non-Distributed Model Processing Time: 38 HOURS

Distributed Model Processing Time: 2 HOURS

2. Accuracy:

Non-distributed and distributed RESULTS MATCHED

3. Cost

Roughly 4 HOURS required to convert and validate each model

Preserved existing code structure and Math-Stat way of working

2018 Pilot in Detail

APPENDIX SLIDES

MOVED SLIDES FROM PREVIOUS DRAFT TO BACK OF PRESENTATION

Performance of AdRec Modeling Programs

Performance of Long-Running Occupied Model Program

Accuracy of Scored Predictions from Occupied Model

Cost of Converting 9.4 LOGISTIC to Viya LOGSELECT

Worker Node 9

Worker Node 8

Worker Node 7

Worker Node 6

Worker Node 5

Business User

Developer

CAES Cluster

High Speed Local Network

Communication, No Data Moved

- NameNode 1- Resource Manager 2- Journal Keeper- Zookeeper- SAS Embedded Process

Master Node 1

20 CPU Cores256 GB Memory

12x 2 TB Disk Storage

Master Node 2

- Resource Manager 1- Hive Metastore 2- HiveServer 2- WebHCat 2- Journal Keeper- SAS Embedded Process

Master Node 3

- NameNode 1- History Server- Timeline Server- Journal Keeper- Zookeeper- SAS Embedded Process

Master Node 4

- Hive Metastore 1- HiveServer 1- WebHCat 1- Zookeeper- SAS Embedded Process

Worker Node 4

Worker Node 3

16 TB Disk Storage

- DataNode- NodeManager- Open Source R- SAS Embedded Process

Worker Node 2

Worker Node 1

16 TB Disk Storage

- Knox Gateway- HDP Clients- RStudio

Virtual Machine 2

8 CPU vCores32 GB Memory

1 TB vDisk Storage

- MySQL Database Server

Virtual Machine 3

1 TB vDisk Storage

- Ambari Server- Ranger Audit Server- Ranger Policy Server- Zeppelin- HST Server- Activity Analyzer

Virtual Machine 1

1 TB vDisk Storage

SAS 9.4 Metadata ServerSAS 9.4 Compute Server

SAS Mid-Tier Server

16 TB Disk Storage

SAS Viya Worker Node 4

16 TB Disk Storage

- SAS Visual Analytics (Viya enabled)- SAS Visual Statistics (Viya enabled)- SAS Visual Data Mining

and Machine Learning

SAS Viya Controller Node

16 TB Disk Storage

SAS Viya Microservice Node

16 TB Disk Storage

Legend: Textured blue box: SAS Virtual Machine or Bare MetalSolid blue boxes: SAS Viya servers Bare Metal recommendedGreen Boxes: Hortonworks serversGreen Text: Hortonworks servicesBlue Text: SAS services

- SAS Metadata Server- SAS Web Server- SAS Web Application Server- SAS Web Clients- SAS Environment Manager- SAS Data Loader For Hadoop- SAS Scoring Accelerator for Hadoop- SAS Compute Server

SAS Desktop Client s

CAES 2020 Production Environment

Analytics in Official Statistics - Sas Institute › ... ›...

Documents

Understanding Official Poverty Statistics

Print Official Statistics

BIG Data and OFFICIAL Statistics

Modernising Official Statistics - UNECE · Modernising Official Statistics Steven Vale UNECE steven.vale@unece.org . UNECE Statistics: Priorities ! Population censuses, migration,

What and Why- Official Statistics

Introduction to official statistics

Vaccines Centuries of Official Statistics

Big Data Driven: Official Statistics

Modernising official statistics

Over view on Official statistics

Web Panels in Official Statistics

OFFICIAL STATISTICS

2017 Sportradar Official NASCAR Statistics Feeds...Official NASCAR Statistics Feeds 2017 Season Updated 08.29.17 2 SPORTRADAR NASCAR STATISTICS FEEDS Table of Contents NASCAR Statistics

Web Statistics dengan Google Analytics

UK STATISTICS AUTHORITY Committee for Official Statistics ... · PDF fileUK STATISTICS AUTHORITY Committee for Official Statistics Minutes Tuesday 13 July 2010 Present Members Professor

Modernising Official Statistics - UNECE · Modernising Official Statistics . Introducing the HLG ... •A template architecture for official statistics •A set of standard specifications

Innovations in Official Statistics

Modernising Official Statistics – Recent and Future ... · Modernising Official Statistics – Recent and Future Initiatives Steven Vale UNECE . steven.vale@unece.org . UNECE Statistics:

Discovery Through Statistics Claim Analytics

Advanced Analytics & Statistics with MongoDB