A Data/Detector Characterization Pipeline (What is it and why we need one) Soumya D. Mohanty AEI January 18, 2001 Outline of the talk Functions of a Pipeline

A Data/Detector Characterization Pipeline(What is it and why we need one)

Soumya D. MohantyAEI

January 18, 2001

Outline of the talk• Functions of a Pipeline

• A Walk through a candidate pipeline

• Requirements: Issues

• Proposal for a plan of work

The functions of a pipeline

• Why have one? – Understanding a new feature or establishing confidence in detection will

require a fair amount of manual work (human intensive).

– Large data rate (main+auxiliary channels) implies that an automated tool that helps in focussing our attention is essential.

• Definition: An automated tool to point out “interesting” segments.– Not meant for detector commissioning stage data.

– Types: Data/detector Characterization, Data preparation or conditioning.

– May not be possible to cleanly separate the design process.

– Byproducts: routine, uninteresting information (data summaries) to support data mining tasks.

• Open Issue : What is interesting?– Automated tool means precise definition of interesting features required.

– Example: Change in PSD, Transients, Change in cross-couplings, …

Pipeline: Not just a sum of its parts

• Simple Example– Transient test characterized without studying effect on/of line noise.– Line removal tool characterized without studying effect on/of

transients.– When real data is passed through the line removal tool followed by

the transient test, the result will be different from transient test followed by line removal.

• There can exist other “cross-couplings” which will affect the overall performance of a pipeline.

• Computational costs need not be a simple sum of parts.• Pipeline design and characterization will involve more than

the study of tools in isolation.

Analyzing pipeline performance

• Basic criteria: The pipeline should not make too many mistakes. On the other hand, it should not lose interesting segments.– Extremely reliable statistical characterization will be required.

• Open issue: Metrics for pipeline performance (or pipeline calibration).– Metric must include: False alarm and Detection, dependence on a priori

modeling of data, Computational costs, …

– For data preparation pipeline: Calibrate by injecting GW signals into input.

– For data/detector characterization pipeline: ?

• Bottom Line: Lot of experience with simulated and real data is required.

A Candidate Pipeline

• Design Status: At the stage of a blueprint that can be implemented.– Several new tools identified that need to be developed. (e.g., need a line

removal method which is unaffected by transients.)– The blueprint is concrete enough to begin computational cost studies and

statistical characterization studies.

• Origins– The word “pipeline” has been used on several occasions (e.g., LSC Data

Analysis White Paper) but this is the first concrete design.– 1999: SDM Commissioned to design one as part of the 40m/TAMA

coincidence analysis project.

• Important: A pipeline will affect planning for other data analysis components.– Examples: Software/hardware environment, User interfaces, A sophisticated

database or simple sequential files, Interfaces to DAQ, ...

Data/Detector Characterization Pipeline

Requirements: Issues• Computing.

– Should work online.

– Memory requirements might be non-trivial if database access overheads turn out to be large.

• Implementation Language and environment.

Within LDAS (adapted to GEO)? Language: C++

TRIANA? JAVA

DMT? VEGA? C++

• Database. Not an issue confined to this pipeline alone.– Need depends on what kind of data mining tasks will be required.

– Examples : (1) Collect data with a particular type of transient (2) Store information about new types of features.

• Others.– Lots of ideas and guidelines from users required for the design phase.

– Code writing and testing phase will be manpower intensive.

Proposal for a plan of work (fastest)

• Almost all components available in MATLAB.

• Use sequential files instead of relational database.

• Implement as a large MATLAB program.

• Come up with some metrics of performance.

• Test against simulated and some real data.

• If (coincidence run with LIGO), aim to produce X hours of characterized data using this MATLAB code.

• In the meantime, work on related issues and requirements definition.

Conclusions

• Large amount of data makes it necessary to have a Pipeline in order to direct our attention to where it is really required.

• Pipeline design and characterization requires more than listing tools and studying them in isolation.

• Pipeline designing can identify missing features.

• A concrete design now exists.

• Several candidate pipelines must be generated and compared.

• What is interesting? Guidelines, Ideas and experience with real data required to evolve an answer.

Documents

A Data/Detector Characterization Pipeline (What is it and why we need one) Soumya D. Mohanty AEI January 18, 2001 Outline of the talk Functions of a Pipeline