39
11/26/03 Yingping Huang 1 Infrastructure, Data Cleansing and Mining for Scientific Simulations Yingping Huang Committee Members: Dr. Bowyer Dr. Flynn Dr. Madey Dr. Uhran

Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang1

Infrastructure, DataCleansing and Mining for

Scientific Simulations

Yingping HuangCommittee Members:

Dr. BowyerDr. FlynnDr. MadeyDr. Uhran

Page 2: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang2

Agenda

uOverview

uBackground

uMulti-tier infrastructure

uData cleansing algorithms

uData mining applications

uSummarize

uTimeframe

Page 3: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang3

Overview

uMulti-tier infrastructure powers scientificsimulations.

uData cleansing algorithms result inbetter data quality.

uData mining applications discoverhidden knowledge in environmental andsocial science.

Page 4: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang4

Motivation

Infrastructure

NOM OSS

Data StorageData AnalysisReports

CollaborationPersonalizationWeb-based

Simulation Anytime and Anywhere

Page 5: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang5

Agenda

uOverview

uBackground

uMulti-tier infrastructure

uData cleansing algorithms

uData mining applications

uSummarize

uTimeframe

Page 6: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang6

Backgroundu Projects under way

n NOMu Research on natural organic matter (NOM)u Study evolution of NOM over timeu Joint work of scientists across disciplines including

chemists, biochemists, environmental scientists

n OSSu Research on the open source software (OSS)

development phenomenonu Study the behavior of OSS developers and their

motivationsu Joint work with social scientists

Page 7: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang7

Simulation Modelsu Standalone or traditional client-server

n Software needs to be installed on clientsn Incompatibility makes installation difficult

u Web-based using appletsn Security – file permission, firewalln Inconvenience – plug-ins downloadn Network traffic – download before executingn Incompatibility – Swarm

u What should be done?n Web-based server-side simulation modelsn Centralized simulation managementn Collaboration and personalization

Page 8: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang8

Data Cleansingu Known approaches

n Sorted neighborhood (Stolfo 1995/1998)u Domain dependent keys for sorting

n Record matching(Monge, 2000)u Edit distance only

n String mapping (Li, 2003)u Potential high dimensional target space

u Our approachesn Sample databasen Lipschitz mapping

Page 9: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang9

Data Miningu Data mining in astronomy

n SKYCAT: star/galaxy classification (Fayyad, 1996)n JARTool: detect volcanoes on Venus (Burl, 1998)n Sapphire: find galaxies (Kamath, 2001)

u Data mining in biologyn Bioinformaticsn SARS diagnosis (ehealth.org)

u What should be done?n Data mining for social science (OSS)n Data mining for environmental science (NOM)n Add intelligence to simulation models by applying data

mining results

Page 10: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang10

Agenda

uOverview

uBackground

uMulti-tier infrastructure

uData cleansing algorithms

uData mining applications

uSummarize

uTimeframe

Page 11: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang11

Physical Layout

InternetInternetNetworkSwitch

10.10.0.1

10.10.0.210.10.0.310.10.0.410.10.0.5

10.10.0.610.10.0.710.10.0.810.10.0.910.10.0.10

129.74.xxx.yyy

129.74.aaa.bbb

The Simulation Manager

External Servers and ClientsPrivate network

Page 12: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang12

Multi-tierArchitecture

HTTP Client tier

HTTP Server tier

Application Server tier

Database Server tier

Client 1

Client 2

…Client N

Server

Server 1

Server 2

Server 3

Server 4

Server 5

SD

SMDW

STDBY

Page 13: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang13

Two Features

u Load-balancingn Scalability achieved

n Implementation using JMS, AQ & EJB

n Implementation using Shell scripts & PL/SQL

u Simulation-resumingn Reliability achieved

n Checkpoint

n Implementation using JTA/JTS

Page 14: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang14

Load-balancingUsing JMS & AQ

Job queue00100

statuscheckpointresumedjob_id

Topic 1

mdb1 mdb2 mdb3 mdb4 mdb5

Loadavg queue

Judge bean Topic 2

mdb6 mdb7 mdb8 mdb9 mdb10

Invoke simulation & update job queue

Page 15: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang15

Shell Scripts &PL/SQL

uDispatcher (HTTP server)n Dispatch simulationsn Send KEEPALIVE messages to running

simulations

u Intelligent agent (application server)n Upload load averagesn Check simulationsn Send ACK to KEEPALIVE messages

Page 16: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang16

Load-balancingAlgorithm

u Instance learning approachn Based on completion time prediction

uTwo step completion time predictionn Completion time estimation

u Load average

uData amount

n Completion time predictionuNearest neighborhood

Page 17: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang17

Completion TimeEstimation

uCompletion time estimation formula

Page 18: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang18

Checkpoint

JDBC

SD SM

JTA/JTS

One transaction

Page 19: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang19

CheckpointIssues

uCheckpoint datan All data for restarting the simulation

n Size depends on number of agents

uCheckpoint frequencyn Checkpoint-interval

u # of MB data

n Checkpoint-timeoutu # of minutes

Page 20: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang20

Simulation-resuming

u To restart a terminated simulationn A new simulation with same job_id inserted into

the job queue

n A terminated simulation has smaller job_id thannew simulations, higher priority

u In case of application server failuren All simulations’ job_ids inserted into the job queue

n All simulations will be running on other applicationservers

Page 21: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang21

CollaborationSuite

Page 22: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang22

Graphical Reports

Page 23: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang23

XML Reports

Page 24: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang24

Agenda

uOverview

uBackground

uMulti-tier information system

uData mining applications

uSummarize

uTimeframe

Page 25: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang25

Methodologyu Traditional approach

n Form hypothesesn Verify hypotheses by finding patterns in data

u Data mining approachn Find patterns in datan Form hypothesesn Design simulation modelsn Verify hypotheses

u U. Fayyad, J. Gray at Microsoft Research

Page 26: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang26

Technology &Software

u Data mining technologyn Clustering

u K-means

u Orthogonal cluster

n Classificationu Decision tree

u Naïve Bayes

n Association rulesu Apriori

u Data mining softwaren Oracle Data Mining

Suiten DM4Jn JDeveloper

Page 27: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang27

OSSu Study behavior of open source software (OSS)

developersn Agent-basedn Stochastic

u Data mining involvingn Clusteringn Classification

u Churn predictionu Acquisition prediction

n Association rules

Page 28: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang28

OSS DataWarehousing

u Data from sourceforge.comn Developers

n Projects

u Data warehousingn Table partitioning

n Aggregation

n Star schema

n Analysis SQL

n ETL tools ‡ Warehouse Builder

Page 29: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang29

NOMu Study behavior of natural organic matter

(NOM)n Agent-basedn Stochastic

u Data mining involvingn Clustering

u Micelle formation

n Classificationu Transportation predictionu Adsorption prediction

n Association rules

Page 30: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang30

Agenda

uOverview

uBackground

uMulti-tier information system

uData mining applications

uSummarize

uTimeframe

Page 31: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang31

Summarize

uMulti-tier information system integratesn Application servers & reports server

n Database servers

n Data warehousing & data mining

n Swarm

uCollaboration suite

uData mining guided model-design

Page 32: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang32

Insights &Impacts

u Server-side simulation modelsn Centralized simulation managementn Centralized data repository

u Collaboration suiten Simulation sharingn Knowledge sharing

u Data mining applicationsn Find patterns in datan Model deployment for simulation-design

Page 33: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang33

Agenda

uOverview

uBackground

uMulti-tier information system

uData mining applications

uSummarize

uTimeframe

Page 34: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang34

TimeframeMay 2003 ~ May 2004

0 3 6 9 12

Implement infrastructure

Data collection & statistical analysis

Data mining model design

Data mining model evaluation

Deployment

Writing up

Page 35: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang35

ExpectedPublications

u Information system design for scientificsimulationsn By August 2003

u Data warehousing for scientific simulationsn By November 2003

u Data mining for OSSn By February 2004

u Data mining for NOMn By March 2004

Page 36: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang36

Demo

Demonstration

Page 37: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang37

Finally

Thank you!

Page 38: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang38

Featuresu Multi-tier information system

n HTTP client tier ‡ HTTP server tier ‡Applicationserver tier ‡EIS tier

u Scalability at the application server tiern Load-balancing

u Reliability at the application server tiern Simulation-resuming

u Reliability at the database tiern Standby databases

Page 39: Infrastructure, Data Cleansing and Mining for Scientific ...nom/Papers/yingping_proposal_slides.pdf11/26/03 3 Yingping Huang Overview uMulti-tier infrastructure powers scientific simulations

11/26/03 Yingping Huang39

Features (cont.)

uData mining modelsn Stored in databasen Stored Java proceduresn PL/SQL procedure call using JDBC

uSimulation modelsn Agent-basedn Stochasticn Data mining guided