Upload
kuhan-wang
View
230
Download
0
Embed Size (px)
Citation preview
Renaissance Technologies Presentation
-
Insight Data Science
Kuhan Wang
October 21th, 2015
1 / 20
Introduction
Insight Data Science: developeda machine learning pipeline in aconsulting project.
PhD Particle Physics, McGillUniversity, researcher on the LargeHadron Collider.
Lead the search for microscopicblack holes and exotic gravitystates in the ATLAS Collaboration.
2 / 20
Consulting Scenario
Company X wishes to maximize user engagement throughoptimal placement of advertisements on content URLs.
Ad Type: Tourism
Keyword: Cuba
Keyword: Package Tour
Keyword: Airplane
Ad Type X
Keyword 1
Keyword 2
Keyword 3
Keyword N
.
.
.
Example: Tourism ads not ideal on investment content URL.
3 / 20
A Pipeline to Analyze Textual Features
Developed and implemented a pipeline to analyzeimportance of textual feature on content URLs relative toengagement.
Scrape URL
Process Text
Model Features
Extract Keywords
Update Keywords
Collect Data, Reiterate
Begin
4 / 20
User Engagement Data
Occurrences
Cou
nts
Summary of Engagement Data
Page Loaded
Ad Viewed
Ad Clicked
Summary of Engagement Data
5 / 20
Modeling
Attempted linear regression.
Classify engagement as yes/no.
Word Count0 1 2 3 4 5 6 7 8 9 10
Pro
babi
lity
[%]
0
0.2
0.4
0.6
0.8
1
Logistic Classification Model
Ad Clicked
Ad Not Clicked
Logistic Classification Model
6 / 20
Validation
Randomly split data into training/test sets.- Distribution of validation scores (shown for 50/50 split).
Precision0.55 0.6 0.65 0.7 0.75 0.8 0.85
Rec
all
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
Num
ber
of M
C T
oys
0
10
20
30
40
50
60
70
80
Ad Type 1
Distribution of Precision vs Recall for 50.0% Test/Train
⟩ Precision, Recall ⟨
7 / 20
Deliverables
Extracted keywords:
Rank Ad Type 1 Ad Type 2 Ad Type 3 Ad Type 41 debt coordinator mortgage gold2 gift administrative home 03 profit minimum procurement stock4 check minimum wage loan fund5 balance reports trustee event
Pipeline in Python is delivered to company forimplementation.
Project details: http://kuhanw.zohosites.com/.
8 / 20
9 / 20
Dissertation Project
Particle Colliders recreate conditions in the early universe.
Searched for signatures of microscopic gravity at the Large HadronCollider.
10 / 20
The Large Hadron Collider
27 km ring, most powerful particle accelerator built to date.
- 13 TeV collisions.
ATLAS: a giant particle detector.
Produced black holes leave debris due to evaporation inside detector.
2008 JINST 3 S08001
Figure 2.1: Schematic layout of the LHC (Beam 1- clockwise, Beam 2 — anticlockwise).
systems. The insertion at Point 4 contains two RF systems: one independent system for each LHCbeam. The straight section at Point 6 contains the beam dump insertion, where the two beams arevertically extracted from the machine using a combination of horizontally deflecting fast-pulsed(’kicker’) magnets and vertically-deflecting double steel septum magnets. Each beam features anindependent abort system. The LHC lattice has evolved over several versions. A summary of thedifferent LHC lattice versions up to version 6.4 is given in ref. [20].
The arcs of LHC lattice version 6.4 are made of 23 regular arc cells. The arc cells are 106.9 mlong and are made out of two 53.45 m long half cells, each of which contains one 5.355 m longcold mass (6.63 m long cryostat), a short straight section (SSS) assembly, and three 14.3 m longdipole magnets. The LHC arc cell has been optimized for a maximum integrated dipole field alongthe arc with a minimum number of magnet interconnections and with the smallest possible beamenvelopes. Figure 2.2 shows a schematic layout of one LHC half-cell.
– 8 –
11 / 20
Data Processing
Developed complete analysispipeline in C++.
- Processed ∼10 TB of LHC datausing distributed computingmethods.
Raw Data From Detector
Processed Data with Objects
Analysis Data Structure
Histogram Data for Final Fitting
~ TB
~ 100 GB
~ GB
~ TB
~ MB
12 / 20
Technical Analysis
~Energy of event
Black Hole Signals
Background Prediction
Quantify compatibility with likelihood model.
L(ns |µ, b, θ) = P(ns |s, µ, b, θ)×∏i
Nsyst(θ0, θ, σθ)i . (1)
13 / 20
Results
Placed leading constraints on models of microscopic gravity physics.
Models of n extra
dimensions
Planck mass of theory
95% CL Exclusion Contours
Black Hole Mass
Model Type
Public results: JHEP 07 (2015) 032, arXiv:1503.08988 [hep-ex]14 / 20
ATLAS Detector
15 / 20
Thank you for your time.
16 / 20
Large Extra Spatial Dimensions
The size and number of extra spatial dimensions suppress theobserved gravitational strength.
Observed gravity is weaker than intrinsic gravity within thebulk.
17 / 20
Backup
Feature Frequency/Documents0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Rel
ativ
e N
umbe
r of
Doc
umen
ts [%
]
4−10
3−10
2−10
1−10
1
Ad Type 1Ad Type 1
18 / 20
FeatureRank
Kuhan Wang1
1. Insight Data Science
October 2, 2015
Abstract
FeatureRank is a software tool for extracting correlations between textngram features and user engagement, thereby optimizing the placementof financial widgets on URL articles.
1 Directory Structure
• /
processing.py
Pre-processing to parse relevant information from engagement csv files.
crawl.py
A simple web crawler that pulls the title and < p > tag text from URLs.
FeatureRank.py
Driver file to execute main functions.
feature_extraction_model.py
The core program that contains the machine learning algorithms.
post_processing.py
Post processing to produce evaluation metrics and ngram rankings.
web_text_data_set_1_2.json
A file containing the sorted JSON dictionaries of each URL, this is theinput to FeatureRank.
read_json.py
A script converting the JSON file into a format that can be read into themodel learning functions.
1
19 / 20
Precision0.55 0.6 0.65 0.7 0.75 0.8 0.85
Rec
all
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
Num
ber
of M
C T
oys
0
5
10
15
20
25
Ad Type 1
Distribution of Precision vs Recall for 0.33% Test/Train
⟩ Precision, Recall ⟨
20 / 20