14
Datafying Bitcoin Tariq B. Ahmad https://github.com/tariq786/datafying_bitcoin

Datafying Bitcoins

Embed Size (px)

Citation preview

Page 1: Datafying Bitcoins

Datafying BitcoinTariq B. Ahmad

https://github.com/tariq786/datafying_bitcoin

Page 2: Datafying Bitcoins

Motivation● Bitcoin is a virtual Peer-to-Peer crypto currency.

● All bitcoin transactions are publicly available (who sent, who received and how much?) but pseudo-anonymous

● This publicly available data is called “blockchain distributed ledger”. Current size is around 70 GB (binary data). Growing every day since 2009.

2

Page 3: Datafying Bitcoins

BlockChain Size

3

Page 4: Datafying Bitcoins

Bitcoin Transaction types

4

one to one transaction

Many to Many transaction

Page 5: Datafying Bitcoins

Block

5

Block contains bitcoin

transactions.

There are almost 400,000

blocks today.

Blockchain contain all

these blocks linked

together like a doubly linklist

Page 6: Datafying Bitcoins

Data● Historical Data

○ Almost 400,000 blocks (new bitcoins)○ More than 104 Million transactions so far

● Live Data○ 2 transaction per second○ Propagate through Peer to Peer

6

69 GB (2009-2016)

Page 7: Datafying Bitcoins

Query

The evolution of bitcoin transaction fee per block.

7

Page 8: Datafying Bitcoins

Working with Data● Run full node locally on AWS => Store the entire blockchain ledger on AWS.● Query blockchain via JSON RPC in Python● Two RPC calls per block (Number of relevant blocks ~ 200,000 and 6.5 GB

of text storage)○ Av time per RPC call = 1.45 sec (huge performance bottleneck. Work around is to reduce RPC

calls to one RPC call by storing all blocks in json format on disk/HDFS)

8

Bitcoin Node APP

get block RPC call

block json

get transaction RPC call

transaction json

1

2

Page 9: Datafying Bitcoins

Data Pipeline

9

Ingestion File SystemBatch

processingDatabase

Visualization

BitcoinNode

(Local Disk)

StreamprocessingNetcat

Relay

Page 10: Datafying Bitcoins

Accomplishments and Challenges● Complex query (bitcoin transaction fee evolution) working end to end

● Working with sea of jsons (2 jsons per block) in Apache Spark is complex. Takes time to scale the results

● Ideally comparing three modes (batch,streaming and API) for throughput, latency and cost

● Public APIs have rate limits. After lot of search, found Toshi API https://toshi.io that has no rate limits

10

Page 11: Datafying Bitcoins

11

Mode # of processed blocks

Time(minutes)

Storage

RPC Batch 186,846 162 Local File System

RPC Batch 186,846 69 HDFS

RPC Streaming 187,990 177 -

API Streaming 187,990 222 -

API Batch 187,990 3.1 HDFS

Comparison

Storing data on HDFS pays off with Spark processing taking only 3.1 minutes in API modeand 69 minutes in RPC mode (62 minutes account for RPC call overhead for get transaction)

Page 12: Datafying Bitcoins

Visualization

12

Page 13: Datafying Bitcoins

Zooming in to check discontinuity

13

Page 14: Datafying Bitcoins

About MePhD in Computer Engineering

Parallel Computing & Computer

Security.

In Love with Linux

Likes disruptive technology

14

Thank you + Q&A