View
7
Download
0
Category
Preview:
Citation preview
Analytics in Official Statistics: From
Adaptive Survey Design to the U.S. 2020
Census
Michael T. Thieme
Assistant Director for Systems and Contracts
Decennial Census Programs
U.S. Census Bureau
1
The thoughts and opinions in this presentation are those of the presenter and not necessarily those of the U.S. Census Bureau
2
Disclaimer
▪Survey costs are rising
▪Confidence in government is declining
▪With it, the Public’s willingness to participate in surveys
▪Current methods for producing official statistics are unsustainable
3
Where we are:
How did we get here?
and
4
How do we keep going?
One Barrier to Where We Want to Go
5
1 Source: Fostering Interoperability in Official Statistics: Common Statistical Production Architecture (UNECE, 2013)
Accidental Architecture
6
This is what Accidental Architecture looks like at Census:
The Result?
▪Higher system costs
▪development, operations and maintenance
▪Nearly nonexistent interoperability
▪ Less data accessibility, discoverability, usability
▪Much more difficult to use data analytics and adaptive survey design approaches
7
Part of the Answer:
8
Adaptive Survey Design
9
Survey Data Collection Platform as a Service
A new approach at theU.S. Census Bureau
Concurrent Analysis and Estimation System
Unified Tracking System (Paradata Repository)
Centralized Operational Analysis and Control (Multimode Operational Control System)
CaRDS(ACS/
Decennial )
MAF/TIGER(Decennial)
Business Register(ECON)
Frame and Sample Systems
CaRDS(ACS and
Decennial)
BR, MADB, StEPS II(ECON)Integrated Field Operation Control
Systems
Time & Attendance
Systems
Response Processing
Systems
Centurion/ISR
iCADE ATAC CQA/IVRCOMET CLMS Enumeration
Modest Beginnings
▪National Survey of College Graduates▪Developed R-Indicator Model
▪Ran experiments
▪Built confidence
10
Modest Beginnings
▪Census Tests
▪2014, 2015, 2016
▪Administrative record modeling
▪Optimization of field work
▪Changed the way we do Censuses
11
The U.S. 2020 Census
Using Analytics to:
▪Optimize the 2020 Census paid advertising campaign
▪ Identify vacant housing units
▪Optimize the number of enumeration attempts
▪ Identify best time to knock on doors
▪Optimize field worker efficiency
12
13
2. SAS 9.4 for non-distributed processing
1. Hortonworks Hadoop for storage and in-database processing
3. SAS Viya for distributed processing
CAES 2020 Production Environment
3 years of testing non-distributed versus distributed along 3 dimensions
1. Performance: How fast can we go?
2. Accuracy: When we go fast, do we come up with the same result?
3. Cost: What does it take to achieve better performance and the same level of precision?
Technology Performance Accuracy Cost
2015 Pilot SAS LASR In-Memory
2016 Pilot SAS In-Database (via Map Reduce)
2018 Pilot SAS Viya In-Memory ? ? ?
The Journey to CAES 2020
15
Business Goal: speed up Decennial Administrative Records process
1. Performance:
Non-Distributed Model Processing Time: 38 HOURS
Distributed Model Processing Time: 2 HOURS
2. Accuracy:
Non-distributed and distributed RESULTS MATCHED
3. Cost
Roughly 4 HOURS required to convert and validate each model
Preserved existing code structure and Math-Stat way of working
2018 Pilot in Detail
APPENDIX SLIDES
MOVED SLIDES FROM PREVIOUS DRAFT TO BACK OF PRESENTATION
16
Performance of AdRec Modeling Programs
Performance of Long-Running Occupied Model Program
Accuracy of Scored Predictions from Occupied Model
Cost of Converting 9.4 LOGISTIC to Viya LOGSELECT
Worker Node 9
Worker Node 8
Worker Node 7
Worker Node 6
Worker Node 5
CAES
Business User
Developer
Admin
CAES Cluster
High Speed Local Network
Communication, No Data Moved
- NameNode 1- Resource Manager 2- Journal Keeper- Zookeeper- SAS Embedded Process
Master Node 1
20 CPU Cores256 GB Memory
12x 2 TB Disk Storage
Master Node 2
20 CPU Cores256 GB Memory
12x 2 TB Disk Storage
- Resource Manager 1- Hive Metastore 2- HiveServer 2- WebHCat 2- Journal Keeper- SAS Embedded Process
Master Node 3
20 CPU Cores256 GB Memory
12x 2 TB Disk Storage
- NameNode 1- History Server- Timeline Server- Journal Keeper- Zookeeper- SAS Embedded Process
Master Node 4
20 CPU Cores256 GB Memory
12x 2 TB Disk Storage
- Hive Metastore 1- HiveServer 1- WebHCat 1- Zookeeper- SAS Embedded Process
Worker Node 4
Worker Node 3
28 CPU Cores384 GB Memory
16 TB Disk Storage
- DataNode- NodeManager- Open Source R- SAS Embedded Process
Worker Node 2
Worker Node 1
20 CPU Cores256 GB Memory
16 TB Disk Storage
- Knox Gateway- HDP Clients- RStudio
Virtual Machine 2
8 CPU vCores32 GB Memory
1 TB vDisk Storage
- MySQL Database Server
Virtual Machine 3
8 CPU vCores32 GB Memory
1 TB vDisk Storage
- Ambari Server- Ranger Audit Server- Ranger Policy Server- Zeppelin- HST Server- Activity Analyzer
Virtual Machine 1
8 CPU vCores32 GB Memory
1 TB vDisk Storage
SAS 9.4 Metadata ServerSAS 9.4 Compute Server
SAS Mid-Tier Server
28 CPU Cores384 GB Memory
16 TB Disk Storage
SAS Viya Worker Node 4
28 CPU Cores384 GB Memory
16 TB Disk Storage
SAS Viya Worker Node 3
28 CPU Cores384 GB Memory
16 TB Disk Storage
SAS Viya Worker Node 2
28 CPU Cores384 GB Memory
16 TB Disk Storage
SAS Viya Worker Node 1
28 CPU Cores384 GB Memory
16 TB Disk Storage
- SAS Visual Analytics (Viya enabled)- SAS Visual Statistics (Viya enabled)- SAS Visual Data Mining
and Machine Learning
SAS Viya Controller Node
28 CPU Cores384 GB Memory
16 TB Disk Storage
SAS Viya Microservice Node
28 CPU Cores384 GB Memory
16 TB Disk Storage
Legend: Textured blue box: SAS Virtual Machine or Bare MetalSolid blue boxes: SAS Viya servers Bare Metal recommendedGreen Boxes: Hortonworks serversGreen Text: Hortonworks servicesBlue Text: SAS services
- SAS Metadata Server- SAS Web Server- SAS Web Application Server- SAS Web Clients- SAS Environment Manager- SAS Data Loader For Hadoop- SAS Scoring Accelerator for Hadoop- SAS Compute Server
SAS Desktop Client s
CAES 2020 Production Environment
Recommended