Click here to load reader
Upload
mlconf
View
100
Download
0
Embed Size (px)
Citation preview
Surveillance Platform for Bank Compliance
Mayur Thakur, Goldman Sachs
2
• “Banks pay out £166bn over six years: a history of banking
misdeeds and fines” – The Guardian
• “Banks 'pay 60%' of profits in fines and customer payments” – BBC
News
• “Deutsche Bank to Pay $2.5 Billion to Settle Libor Investigation” –
The Wall Street Journal
• “$1.2 Billion Fine for Hedge Fund SAC Capital in Insider Case” – NY
Times
Stakes Are High
Key Technical Challenges
Diverse data sets and formats (sql, flatfiles, proprietary, etc)
Size of data, updated frequently
• ~1B* pieces of text per year
• ~1B edges in a graph
• 100s of millions of trading events in a day
Data from past can change (e.g., manual trade correction)
• Causes a cascade of changes
Surveillance decisions need to be debuggable
• Why was trade X on Oct 25, 2015 not flagged?
Not real time; often need time guarantees (say, T+1) * All numbers are “orders of magnitudes”
3
Surveillance Architechture
4
SQL 1
Surv. 1
SQL n
Flatfile 1
Flatfile m
Prop 1
Prop k
HDFS 1
HDFS q
Flattened
1
Flattened
2
Flattened
n
Preprocessing
pipeline 1
Preprocessing
pipeline 2
Preprocessing
pipeline n
Alerts …
Bookkeeping
Spoofing Illustration
5
A Real World Spoofing Case
Navinder Singh Sarao was accused of spoofing
…and even contributing to the flash crash of 2010
Sarao pled guilty to spoofing in Nov 2016
He allegedly made $40M in illegal profit over years.
6
Review of Regulatory Cases
7
– Analyzed six regulatory enforcement cases for related to spoofing
– Identified common factors indicative of spoofing behavior
• Creating false impression of demand by placing spoof orders on opposite
side to trigger a price movement (“order imbalance”)
• Cancellation of spoof orders within short time after pivot execution (“time
to cancel post execution”)
Case
Factors
Order Imbalance Time to Cancel Post Execution
( > 2.5 times ) ( < 1 sec )
Sarao / Flash Crash a a
Hold Brothers a a
Coscia/ Panther a a
Visionary Trading NA a
Swift a 5 secs
3 Red a a
Transactions Data Pipeline
Spoofing implementation has 2 parts: data preprocessing and surveillance logic
Data preprocessing pipeline is reused for multiple surveillances
~100M orders, 1B mkt data points, 100K products, multiple order mgmt. system
8
Order 1
Related
Transactions
Spoofing
Orders n
Exec. 1
Exec m
Market 1
Market k
Account
Product
Flattened
Order
Flattened
Exec
Flattened
Market
Order Processing
Pipeline
Exec. Processing
Pipeline
Mkt. Processing
Pipeline Front
running Surv. n …
Alerts Alerts Alerts …
Related Transactions Table
9
Related
Transactions
Pivot Exec Orders Execs/Cancels MktData
216.8, 216.9, …
One row of the related transactions table contains information about one pivot
execution and all the activity around the time of that execution.
X X X X
Search Problem
10
Given a semi-structured corpus of about a 1B documents in a
hadoop cluster, design a search engine over YARN that is
fast and satisfies the investigative needs of a variety of users.
Unique Challenges
Cannot move data outside of an already existing hadoop cluster
Support deep scoring algorithms specifically for GS-specific signals
(colloquial language, trades, etc)
Unstructured and structured signals
Search Workflow
11
Search
Master Ranker
Fast Index
Servers
Slow Index
Servers
HBase
Web
Client
Yarn containers
HDFS
• Implemented as YARN apps
• Auth enabled
• Slow index Servers can scale as much as HBase
# indexed documents > 1Billion
# indexed tokens > 500 billion
Current Index Size Runs in several TBs (Memory
and Disk)