20
Renaissance Technologies Presentation - Insight Data Science Kuhan Wang October 21th, 2015 1 / 20

Renaissance v2

Embed Size (px)

Citation preview

Page 1: Renaissance v2

Renaissance Technologies Presentation

-

Insight Data Science

Kuhan Wang

October 21th, 2015

1 / 20

Page 2: Renaissance v2

Introduction

Insight Data Science: developeda machine learning pipeline in aconsulting project.

PhD Particle Physics, McGillUniversity, researcher on the LargeHadron Collider.

Lead the search for microscopicblack holes and exotic gravitystates in the ATLAS Collaboration.

2 / 20

Page 3: Renaissance v2

Consulting Scenario

Company X wishes to maximize user engagement throughoptimal placement of advertisements on content URLs.

Ad Type: Tourism

Keyword: Cuba

Keyword: Package Tour

Keyword: Airplane

Ad Type X

Keyword 1

Keyword 2

Keyword 3

Keyword N

.

.

.

Example: Tourism ads not ideal on investment content URL.

3 / 20

Page 4: Renaissance v2

A Pipeline to Analyze Textual Features

Developed and implemented a pipeline to analyzeimportance of textual feature on content URLs relative toengagement.

Scrape URL

Process Text

Model Features

Extract Keywords

Update Keywords

Collect Data, Reiterate

Begin

4 / 20

Page 5: Renaissance v2

User Engagement Data

Occurrences

Cou

nts

Summary of Engagement Data

Page Loaded

Ad Viewed

Ad Clicked

Summary of Engagement Data

5 / 20

Page 6: Renaissance v2

Modeling

Attempted linear regression.

Classify engagement as yes/no.

Word Count0 1 2 3 4 5 6 7 8 9 10

Pro

babi

lity

[%]

0

0.2

0.4

0.6

0.8

1

Logistic Classification Model

Ad Clicked

Ad Not Clicked

Logistic Classification Model

6 / 20

Page 7: Renaissance v2

Validation

Randomly split data into training/test sets.- Distribution of validation scores (shown for 50/50 split).

Precision0.55 0.6 0.65 0.7 0.75 0.8 0.85

Rec

all

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

Num

ber

of M

C T

oys

0

10

20

30

40

50

60

70

80

Ad Type 1

Distribution of Precision vs Recall for 50.0% Test/Train

⟩ Precision, Recall ⟨

7 / 20

Page 8: Renaissance v2

Deliverables

Extracted keywords:

Rank Ad Type 1 Ad Type 2 Ad Type 3 Ad Type 41 debt coordinator mortgage gold2 gift administrative home 03 profit minimum procurement stock4 check minimum wage loan fund5 balance reports trustee event

Pipeline in Python is delivered to company forimplementation.

Project details: http://kuhanw.zohosites.com/.

8 / 20

Page 9: Renaissance v2

9 / 20

Page 10: Renaissance v2

Dissertation Project

Particle Colliders recreate conditions in the early universe.

Searched for signatures of microscopic gravity at the Large HadronCollider.

10 / 20

Page 11: Renaissance v2

The Large Hadron Collider

27 km ring, most powerful particle accelerator built to date.

- 13 TeV collisions.

ATLAS: a giant particle detector.

Produced black holes leave debris due to evaporation inside detector.

2008 JINST 3 S08001

Figure 2.1: Schematic layout of the LHC (Beam 1- clockwise, Beam 2 — anticlockwise).

systems. The insertion at Point 4 contains two RF systems: one independent system for each LHCbeam. The straight section at Point 6 contains the beam dump insertion, where the two beams arevertically extracted from the machine using a combination of horizontally deflecting fast-pulsed(’kicker’) magnets and vertically-deflecting double steel septum magnets. Each beam features anindependent abort system. The LHC lattice has evolved over several versions. A summary of thedifferent LHC lattice versions up to version 6.4 is given in ref. [20].

The arcs of LHC lattice version 6.4 are made of 23 regular arc cells. The arc cells are 106.9 mlong and are made out of two 53.45 m long half cells, each of which contains one 5.355 m longcold mass (6.63 m long cryostat), a short straight section (SSS) assembly, and three 14.3 m longdipole magnets. The LHC arc cell has been optimized for a maximum integrated dipole field alongthe arc with a minimum number of magnet interconnections and with the smallest possible beamenvelopes. Figure 2.2 shows a schematic layout of one LHC half-cell.

– 8 –

11 / 20

Page 12: Renaissance v2

Data Processing

Developed complete analysispipeline in C++.

- Processed ∼10 TB of LHC datausing distributed computingmethods.

Raw Data From Detector

Processed Data with Objects

Analysis Data Structure

Histogram Data for Final Fitting

~ TB

~ 100 GB

~ GB

~ TB

~ MB

12 / 20

Page 13: Renaissance v2

Technical Analysis

~Energy of event

Black Hole Signals

Background Prediction

Quantify compatibility with likelihood model.

L(ns |µ, b, θ) = P(ns |s, µ, b, θ)×∏i

Nsyst(θ0, θ, σθ)i . (1)

13 / 20

Page 14: Renaissance v2

Results

Placed leading constraints on models of microscopic gravity physics.

Models of n extra

dimensions

Planck mass of theory

95% CL Exclusion Contours

Black Hole Mass

Model Type

Public results: JHEP 07 (2015) 032, arXiv:1503.08988 [hep-ex]14 / 20

Page 15: Renaissance v2

ATLAS Detector

15 / 20

Page 16: Renaissance v2

Thank you for your time.

16 / 20

Page 17: Renaissance v2

Large Extra Spatial Dimensions

The size and number of extra spatial dimensions suppress theobserved gravitational strength.

Observed gravity is weaker than intrinsic gravity within thebulk.

17 / 20

Page 18: Renaissance v2

Backup

Feature Frequency/Documents0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Rel

ativ

e N

umbe

r of

Doc

umen

ts [%

]

4−10

3−10

2−10

1−10

1

Ad Type 1Ad Type 1

18 / 20

Page 19: Renaissance v2

FeatureRank

Kuhan Wang1

1. Insight Data Science

October 2, 2015

Abstract

FeatureRank is a software tool for extracting correlations between textngram features and user engagement, thereby optimizing the placementof financial widgets on URL articles.

1 Directory Structure

• /

processing.py

Pre-processing to parse relevant information from engagement csv files.

crawl.py

A simple web crawler that pulls the title and < p > tag text from URLs.

FeatureRank.py

Driver file to execute main functions.

feature_extraction_model.py

The core program that contains the machine learning algorithms.

post_processing.py

Post processing to produce evaluation metrics and ngram rankings.

web_text_data_set_1_2.json

A file containing the sorted JSON dictionaries of each URL, this is theinput to FeatureRank.

read_json.py

A script converting the JSON file into a format that can be read into themodel learning functions.

1

19 / 20

Page 20: Renaissance v2

Precision0.55 0.6 0.65 0.7 0.75 0.8 0.85

Rec

all

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

Num

ber

of M

C T

oys

0

5

10

15

20

25

Ad Type 1

Distribution of Precision vs Recall for 0.33% Test/Train

⟩ Precision, Recall ⟨

20 / 20