Datafying Bitcoins

Datafying BitcoinTariq B. Ahmad

https://github.com/tariq786/datafying_bitcoin



Motivation● Bitcoin is a virtual Peer-to-Peer crypto currency.

● All bitcoin transactions are publicly available (who sent, who received and how much?) but pseudo-anonymous

● This publicly available data is called “blockchain distributed ledger”. Current size is around 70 GB (binary data). Growing every day since 2009.

2

BlockChain Size

3

Bitcoin Transaction types

4

one to one transaction

Many to Many transaction

Block

5

Block contains bitcoin

transactions.

There are almost 400,000

blocks today.

Blockchain contain all

these blocks linked

together like a doubly linklist

Data● Historical Data

○ Almost 400,000 blocks (new bitcoins)○ More than 104 Million transactions so far

● Live Data○ 2 transaction per second○ Propagate through Peer to Peer

6

69 GB (2009-2016)

Query

The evolution of bitcoin transaction fee per block.

7

Working with Data● Run full node locally on AWS => Store the entire blockchain ledger on AWS.● Query blockchain via JSON RPC in Python● Two RPC calls per block (Number of relevant blocks ~ 200,000 and 6.5 GB

of text storage)○ Av time per RPC call = 1.45 sec (huge performance bottleneck. Work around is to reduce RPC

calls to one RPC call by storing all blocks in json format on disk/HDFS)

8

Bitcoin Node APP

get block RPC call

block json

get transaction RPC call

transaction json

1

2

Data Pipeline

9

Ingestion File SystemBatch

processingDatabase

Visualization

BitcoinNode

(Local Disk)

StreamprocessingNetcat

Relay

Accomplishments and Challenges● Complex query (bitcoin transaction fee evolution) working end to end

● Working with sea of jsons (2 jsons per block) in Apache Spark is complex. Takes time to scale the results

● Ideally comparing three modes (batch,streaming and API) for throughput, latency and cost

● Public APIs have rate limits. After lot of search, found Toshi API https://toshi.io that has no rate limits

10

https://toshi.io

https://toshi.io

https://toshi.io

11

Mode # of processed blocks

Time(minutes)

Storage

RPC Batch 186,846 162 Local File System

RPC Batch 186,846 69 HDFS

RPC Streaming 187,990 177 -

API Streaming 187,990 222 -

API Batch 187,990 3.1 HDFS

Comparison

Storing data on HDFS pays off with Spark processing taking only 3.1 minutes in API modeand 69 minutes in RPC mode (62 minutes account for RPC call overhead for get transaction)

Visualization

12

Zooming in to check discontinuity

13

About MePhD in Computer Engineering

Parallel Computing & Computer

Security.

In Love with Linux

Likes disruptive technology

14

Thank you + Q&A