Talk about T.C.P. for CDI inter-departmental workshop at UC Berkeley. 20090911

, Justin Higgins, Adam Morgan

Josh Bloom (PI)

Broadcast

Database

Classify

Transients

Classification

Pipeline

“Object” Datastream

Broadcast “sources” • interesting or transient source• include classifications• include features, context

Broadcast

Database

ClassifyDatabase containing “sources”

• features for a source• data epochs associated

with a source

Transients

Classification

Pipeline

Survey Y (static survey repository)

SDSSstripe-82

archived data

PTF / LBLsubtraction

pipeline Survey X (real-time survey

telescope)

LSST (future)

SASIR (future)

Database containing “sources”

• features for a source• data epochs associated

with a source

• A deep field from the Sloan Digital Sky Survey

• 750 Million observation epochs

• ~20 Million “sources” clustered from epochs

• 5 colors / filters, 4 years of observations

• We used Stripe-82 for testing and development

Transients

Classification

Pipeline

SDSS Stripe 82SDSSstripe-82

archived data

Palomar Transient Factory

• Palomar 48” telescope

• 100 Mpix, 7.8 sq-deg detector

• ~120s cadence : ~200MB : <100GB/night

• Post subtraction: ~1M difference objects / night

• Post filtering: ~10k difference objects / night

~100s transient and variable stars

MDM 1.3m & 2.4m

PAIRITEL 1.3m

Palomar 60”

PTF consortium

LBLsubtraction

pipelineTCP

Large Synoptic Survey Telescope (LSST):1 Gb every 2 seconds

light curves of 800 million sources every

3 days

106 supernovae/yr105 eclipsing systems107 asteroids...

Next Generation Survey: LSST

TCP

Broadcast

featuregeneration

sourcegeneration

Transients Classification Pipeline

“Object” Datastream

sourceclassification

Database

Follow-uptelescope observations

Parallelized source correlation and classification

• Retrieve difference objects

• Each difference-object is passed to an IPython client

• Each parallel IPython client performs:

• Source creation or correlation with existing sources

• “Feature” generation (or re-generation) for that source

• Classification of that source

featuregeneration

sourcegeneration


Parallelized source correlation and classification

• Realtime TCP runs on 22 dedicated cores

• LCOGT’s 96 core beowulf

• non run-time tasks

• Classifier generation

• Additional resources: (for future classification work)

• Yahoo! M45 cluster

• Amazon EC2 cluster

featuregeneration

sourcegeneration


Warehouse of light-curves

• Need representative light-curves for all science

• With these we can model each science class

• We’ve built a warehouse of example light-curves

TCP-TUTORinternal interface

DotAstro.orgpublic interface

“Noisifying to the Survey”

• Well sampled light-curves

• Can make good classifiers for well-sampled data.

• Don’t immediately make good classifiers for noisy, sparse data.

• We need classifiers which are trained using:

• sampling cadence of our survey

• sparseness of our survey data

• noise and sensitivity limitations of our instrument

• We need “Noisification” software which:

• Resamples well-sampled light-curves

• Outputs noisified sources which are used for generating classifiers


• For PTF:

• Code uses PTF pointing and survey observing plans

• Occasionally PTF observes using a faster cadence:

• 7.5 minutes between revisiting an RA, Dec

• Faster cadence requires a separate set of noisified light-curves and classifiers.

• Other surveys:

• Other pointing and observing plans could be used.

• Can generate noisified light-curves for other surveys.

• Then we can generate science classifiers for these surveys.


Classifiers

• General Classifier

• Timeseries Classifiers

• Weighted combination of WEKA classifiers

• bagged Random Forest classifier using a cost-matrix

• Each classifier trained on different cadenced noisified data

• Astronomer crafted classifiers for specific science types

• Microlens, Super Nova

• well sampled (periodic & nonperiodic)

• interesting sources near known galaxies

• periodic variable science class when confidence is high

• poorly subtracted sources

• minor planets / rocks

• cosmic rays

• detector defects

Filter out:Identify:

Interesting near-galaxy PTF sources

• Identified by TCP during end of Aug ‘09

• Classification triggered by latest epoch added to the source

~0.4 day period RR Lyrae using

10 epoch noisification

• Currently, science classes are determined by combining the weighted probabilities generated by different classification models, for a source.

• Each machine-learned classification model is trained using “noisified” lightcurves which were generated using different parameters.

0.1 - 0.17 day period RR Lyrae using 15 epoch noisification

Clicking on a class for one of dozens of ML models...

...shows highest classification probability sources for that

model::class

~0.14 day period RR Lyrae using

20 epoch noisification

Periodic variable classifiers

Overplotting of period-folded model

still needs work

period-fold plotting probably failed here

Evaluating and Combining Classifiers

• Issues when using multiple classifiers:

• How to combine classifiers when using:

• weighted classifiers

• tree-hierarchy of sub-classifiers

• How to generate final classification “probabilities” when using:

• Widely varying types of classifiers

• Classifiers which contain sub-classifications & probabilities

• Evaluate the final combination of classifiers

• Classify PTF09xxx user classified sources, determine efficiencies

• Classify noisified sources, determine efficiencies

Technology

Talk about T.C.P. for CDI inter-departmental workshop at UC Berkeley. 20090911