11

Click here to load reader

Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

  • Upload
    mlconf

  • View
    100

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Surveillance Platform for Bank Compliance

Mayur Thakur, Goldman Sachs

Page 2: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

2

• “Banks pay out £166bn over six years: a history of banking

misdeeds and fines” – The Guardian

• “Banks 'pay 60%' of profits in fines and customer payments” – BBC

News

• “Deutsche Bank to Pay $2.5 Billion to Settle Libor Investigation” –

The Wall Street Journal

• “$1.2 Billion Fine for Hedge Fund SAC Capital in Insider Case” – NY

Times

Stakes Are High

Page 3: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Key Technical Challenges

Diverse data sets and formats (sql, flatfiles, proprietary, etc)

Size of data, updated frequently

• ~1B* pieces of text per year

• ~1B edges in a graph

• 100s of millions of trading events in a day

Data from past can change (e.g., manual trade correction)

• Causes a cascade of changes

Surveillance decisions need to be debuggable

• Why was trade X on Oct 25, 2015 not flagged?

Not real time; often need time guarantees (say, T+1) * All numbers are “orders of magnitudes”

3

Page 4: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Surveillance Architechture

4

SQL 1

Surv. 1

SQL n

Flatfile 1

Flatfile m

Prop 1

Prop k

HDFS 1

HDFS q

Flattened

1

Flattened

2

Flattened

n

Preprocessing

pipeline 1

Preprocessing

pipeline 2

Preprocessing

pipeline n

Alerts …

Bookkeeping

Page 5: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Spoofing Illustration

5

Page 6: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

A Real World Spoofing Case

Navinder Singh Sarao was accused of spoofing

…and even contributing to the flash crash of 2010

Sarao pled guilty to spoofing in Nov 2016

He allegedly made $40M in illegal profit over years.

6

Page 7: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Review of Regulatory Cases

7

– Analyzed six regulatory enforcement cases for related to spoofing

– Identified common factors indicative of spoofing behavior

• Creating false impression of demand by placing spoof orders on opposite

side to trigger a price movement (“order imbalance”)

• Cancellation of spoof orders within short time after pivot execution (“time

to cancel post execution”)

Case

Factors

Order Imbalance Time to Cancel Post Execution

( > 2.5 times ) ( < 1 sec )

Sarao / Flash Crash a a

Hold Brothers a a

Coscia/ Panther a a

Visionary Trading NA a

Swift a 5 secs

3 Red a a

Page 8: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Transactions Data Pipeline

Spoofing implementation has 2 parts: data preprocessing and surveillance logic

Data preprocessing pipeline is reused for multiple surveillances

~100M orders, 1B mkt data points, 100K products, multiple order mgmt. system

8

Order 1

Related

Transactions

Spoofing

Orders n

Exec. 1

Exec m

Market 1

Market k

Account

Product

Flattened

Order

Flattened

Exec

Flattened

Market

Order Processing

Pipeline

Exec. Processing

Pipeline

Mkt. Processing

Pipeline Front

running Surv. n …

Alerts Alerts Alerts …

Page 9: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Related Transactions Table

9

Related

Transactions

Pivot Exec Orders Execs/Cancels MktData

216.8, 216.9, …

One row of the related transactions table contains information about one pivot

execution and all the activity around the time of that execution.

X X X X

Page 10: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Search Problem

10

Given a semi-structured corpus of about a 1B documents in a

hadoop cluster, design a search engine over YARN that is

fast and satisfies the investigative needs of a variety of users.

Unique Challenges

Cannot move data outside of an already existing hadoop cluster

Support deep scoring algorithms specifically for GS-specific signals

(colloquial language, trades, etc)

Unstructured and structured signals

Page 11: Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Search Workflow

11

Search

Master Ranker

Fast Index

Servers

Slow Index

Servers

HBase

Web

Client

Yarn containers

HDFS

• Implemented as YARN apps

• Auth enabled

• Slow index Servers can scale as much as HBase

# indexed documents > 1Billion

# indexed tokens > 500 billion

Current Index Size Runs in several TBs (Memory

and Disk)