17
Big Data at Aadhaar Dr. Pramod K Varma [email protected] Twitter: @pramodkvarma Regunath Balasubramaian [email protected] Twitter: @RegunathB

Aadhaar at 5th_elephant_v3

Embed Size (px)

DESCRIPTION

Slides used in the talk at the Fifth Elephant Big Data conference by Dr. Pramod Varma and Regunath B

Citation preview

Page 1: Aadhaar at 5th_elephant_v3

Big Data at Aadhaar

Dr. Pramod K [email protected]: @pramodkvarma

Regunath [email protected]: @RegunathB

Page 2: Aadhaar at 5th_elephant_v3

2

Aadhaar at a Glance

Page 3: Aadhaar at 5th_elephant_v3

3

India• 1.2 billion residents– 640,000 villages, ~60% lives under $2/day– ~75% literacy, <3% pays Income Tax, <20% banking– ~800 million mobile, ~200-300 mn migrant workers

• Govt. spends about $25-40 bn on direct subsidies– Residents have no standard identity document– Most programs plagued with ghost and multiple

identities causing leakage of 30-40%

Page 4: Aadhaar at 5th_elephant_v3

4

Vision

• Create a common “national identity” for every “resident”– Biometric backed identity to eliminate duplicates– “Verifiable online identity” for portability

• Applications ecosystem using open APIs– Aadhaar enabled bank account and payment platform – Aadhaar enabled electronic, paperless KYC

Page 5: Aadhaar at 5th_elephant_v3

5

Aadhaar System• Enrolment– One time in a person’s lifetime– Minimal demographics– Multi-modal biometrics (Fingerprints, Iris)– 12-digit unique Aadhaar number assigned

• Authentication– Verify “you are who you claim to be”– Open API based– Multi-device, multi-factor, multi-modal

Page 6: Aadhaar at 5th_elephant_v3

Architecture Principles• Design for scale

– Every component needs to scale to large volumes– Millions of transactions and billions of records– Accommodate failure and design for recovery

• Open architecture– Use of open standards to ensure interoperability– Allow the ecosystem to build libraries to standard APIs– Use of open-source technologies wherever prudent

• Security– End to end security of resident data– Use of open source– Data privacy handling (API and data anonymization)

6

Page 7: Aadhaar at 5th_elephant_v3

Designed for Scale• Horizontal scalability for all components– “Open Scale-out” is the key– Distributed computing on commodity hardware– Distributed data store and data partitioning– Horizontal scaling of “data store” a must!– Use of right data store for right purpose

• No single point of bottleneck for scaling• Asynchronous processing throughout the system– Allows loose coupling various components– Allows independent component level scaling

7

Page 8: Aadhaar at 5th_elephant_v3

Enrolment Volume

• 600 to 800 million UIDs in 4 years– 1 million a day– 200+ trillion matches every day!!!

• ~5MB per resident– Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!)– About 30 TB I/O every day– Replication and backup across DCs of about 5+ TB of incremental

data every day– Lifecycle updates and new enrolments will continue for ever

• Additional process data– Several million events on an average moving through async

channels (some persistent and some transient)– Needing complete update and insert guarantees across data stores

8

Page 9: Aadhaar at 5th_elephant_v3

Authentication Volume• 100+ million authentications per day (10 hrs)

– Possible high variance on peak and average– Sub second response– Guaranteed audits

• Multi-DC architecture– All changes needs to be propagated from enrolment data stores to

all authentication sites

• Authentication request is about 4 K– 100 million authentications a day– 1 billion audit records in 10 days (30+ billion a year)– 4 TB encrypted audit logs in 10 days– Audit write must be guaranteed

9

Page 10: Aadhaar at 5th_elephant_v3

10

Open APIs

• Aadhaar Services– Core Authentication API and supporting Best

Finger Detection, OTP Request APIs– New services being built on top

• Aadhaar Open Standards for Plug-n-play– Biometric Device API– Biometric SDK API– Biometric Identification System API– Transliteration API for Indian Languages

Page 11: Aadhaar at 5th_elephant_v3

11

Implementation

Page 12: Aadhaar at 5th_elephant_v3

Patterns & Technologies• Principles

• POJO based application implementation• Light-weight, custom application container• Http gateway for APIs

• Compute Patterns• Data Locality• Distribute compute (within a OS process and across)

• Compute Architectures• SEDA – Staged Event Driven Architecture• Master-Worker(s) Compute Grid

• Data Access types• High throughput streaming : bio-dedupe, analytics• High volume, moderate latency : workflow, UID records• High volume , low latency : auth, demo-dedupe,

search – eAadhaar, KYC

Page 13: Aadhaar at 5th_elephant_v3

Aadhaar Data Stores (Data consistency challenges..)

Mongo cluster(all enrolment records/documents

– demographics + photo)

Shard 1

Shard 4

Shard 5

Shard 2

Shard 3

Low latency indexed read (Documents per sec), High latency random search (seconds per read)

MySQL (all UID generated records - demographics only,

track & trace, enrolment status )

Low latency indexed read (milli-seconds per read), High latency random search (seconds per read)

UID master (sharded)

Enrolment DB

Solr cluster(all enrolment records/documents

– selected demographics only)

Low latency indexed read (Documents per sec), Low latency random search (Documents per sec)

Shard 0

Shard 2

Shard 6

Shard 9

Shard a

Shard d

Shard f

HDFS(all raw packets)

Data Node 1

Data Node 10

Data Node ..

High read throughput (MB per sec), High latency read (seconds per read)

Data Node 20

HBase(all enrolment

biometric templates)Region Ser. 1

Region Ser. 10

Region Ser. ..

High read throughput (MB per sec), Low-to-Medium latency read (milli-seconds per read)Region

Ser. 20

NFS(all archived raw packets)

Moderate read throughput, High latency read (seconds per read)

LUN 1 LUN 2 LUN 3 LUN 4

Page 14: Aadhaar at 5th_elephant_v3

Aadhaar Architecture

• Work distribution using SEDA & Messaging

• Ability to scale within JVM and across

• Recovery through check-pointing

• Sync Http based Auth gateway

• Protocol Buffers & XML payloads

• Sharded clusters

• Near Real-time data delivery to warehouse• Nightly data-sets used to build dashboards,

data marts and reports

• Real-time monitoring using Events

Page 15: Aadhaar at 5th_elephant_v3

Deployment Monitoring

Page 16: Aadhaar at 5th_elephant_v3

16

Learnings• Make everything API based• Everything fails (hardware, software, network,

storage)– System must recover, retry transactions, and sort of self-

heal

• Security and privacy should not be an afterthought• Scalability does not come from one product• Open scale out is the only way you should go.– Heterogeneous, multi-vendor, commodity compute,

growing linear fashion. Nothing else can adapt!

Page 17: Aadhaar at 5th_elephant_v3

17

Thank You!

Dr. Pramod K [email protected]: @pramodkvarma

Regunath [email protected]: @RegunathB